This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
29/35
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
5/5
sve-gather-scatter-addr-opts.ll

Differential D117900

[AArch64][SVE] Fold gather/scatter with 32bits when possible
ClosedPublic

Authored by CarolineConcatto on Jan 21 2022, 9:05 AM.

Download Raw Diff

Details

Reviewers

efriedma
sdesmalen
david-arm
kmclaughlin

Commits

rG019f0221d52d: [AArch64][SVE] Fold gather/scatter with 32bits when possible

Summary

In AArch64ISelLowering.cpp this patch implements this fold:

GEP (%ptr, (splat(%offset) + stepvector(A)))
into GEP ((%ptr + %offset), stepvector(A))

The above transform simplifies the index operand so that it can be expressed
as i32 elements.
This allows using only one gather/scatter assembly instruction instead of two.

Patch by Paul Walker (@paulwalker-arm).

Depends on D118459

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

CarolineConcatto created this revision.Jan 21 2022, 9:05 AM

Herald added a reviewer: efriedma. · View Herald TranscriptJan 21 2022, 9:05 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

CarolineConcatto requested review of this revision.Jan 21 2022, 9:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2022, 9:05 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

CarolineConcatto edited the summary of this revision. (Show Details)Jan 21 2022, 9:09 AM

CarolineConcatto added reviewers: sdesmalen, david-arm, kmclaughlin.

Harbormaster completed remote builds in B144849: Diff 402008.Jan 21 2022, 10:34 AM

The approach seems right to me, but the code is still a little messy so I mostly left a bunch of suggestions/nits to improve readability.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16371	nit: The word 'Offset' is used quite a few times in this function, which confused me a bit. I think what this is saying is that the eventual instruction will scale the offset (i.e. the offset is not an offset, but an index). So maybe change this to `OffsetIsScaledByInstruction`?
16374–16385	nit: this is equivalent to: // Only consider element types that are pointer sized as smaller types can // be easily promoted. Also ignore indices that are already type legal. EVT IndexVT = Index.getValueType(); if (IndexVT.getVectorElementType() != MVT::i64 \|\| IndexVT == MVT::nxv2i64) return false;
16387	nit: IndexVT isn't used until line 16173, so don't change it here but rather closest to where it's used. Also, I'd suggest storing it a new variable, named e.g. `NewIndexVT`.
16388	Maybe rename this to `Stride` ?
16394	This makes the assumption that STEP_VECTOR is on the right-hand side. There is currently no code that forces this, so you'll want to add a condition to SelectionDAG::getNode() where it reorders the constants to the right, to do something like this: - if ((IsN1C && !IsN2C) \|\| (IsN1CFP && !IsN2CFP)) + + // Favour step_vector on the RHS because it's a kind of + // constant. + if (((N1.getOpcode() == ISD::STEP_VECTOR \|\| IsN1C) && !IsN2C) \|\| + (IsN1CFP && !IsN2CFP))
16395	If `Index.getOperand(1)` is ISD::STEP_VECTOR, then it's operand must be a constant. There's no need to test for it, you can call `Index.getOperand(0).getConstantOperandVal(0)` directly.
16401	nit: Does `N->getScale()` always return a value even when ScaledOffset == false? If so, and it returns a Constant `1` by default, then you can remove the `if (ScaledOffset)` condition above.
16403	If `StepConst` is only set in this block (or in general, when we know for sure that we can fold this into a simpler addressing mode), then ChangedIndex is redundant, and you can just check that StepConst != 0.
16412	This isn't very easy to read, I'd recommend creating three variables instead: if (Index.getOpcode() == ISD::SHL && Index.getOperand(0).getOpcode() == ISD::ADD && Index.getOperand(0).getOperand(1).getOpcode() == ISD::STEP_VECTOR) { SDValue ShiftAmount = Index.getOperand(1); SDValue BaseOffset = Index.getOperand(0).getOperand(0); SDValue Step = Index.getOperand(0).getOperand(1); if (auto Shift = DAG.getSplatValue(ShiftAmount)) if (auto *ShiftC = dyn_cast<ConstantSDNode>(Shift)) if (auto Offset = DAG.getSplatValue(BaseOffset)) { // ... code to calculate Stride, Offset and BasePtr here ...
16434–16439	You can more easily calculate MaxVScale from: const auto &Subtarget = static_cast<const AArch64Subtarget &>(DAG.getSubtarget()); unsigned MaxVScale = Subtarget.getMaxSVEVectorSizeInBits() / AArch64::SVEBitsPerBlock;
16457	Maybe just move all these definitions after the `if(!DCI.isBeforeLegalize()) return SDValue();` (that is, if you take my suggestion mentioned below).
16465	nit: just bail out early to avoid a level of indentation, e.g. if (!DCI.isBeforeLegalize()) return SDValue(); if (FindMoreOptimalIndexType(...)) { ... }
17913–17916	nit: case ISD::MGATHER: case ISD::MSCATTER: return performMaskedGatherScatterCombine(N, DCI, DAG); ?

Matt added a subscriber: Matt.Jan 25 2022, 3:16 PM

This patch has only this pattern GEP (%ptr, (splat(%offset) + stepvector(A)))

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16394	I don't think it is a good solution to add this in SelectionDAG. I could be wrong, but when I do this suggestion I see regressions in sve-stepvector.ll and sve-intrinsics-index. Instead on only 1 instruction it becomes 5. For instance, define <vscale x 16 x i8> @index_rr_i8(i8 %a, i8 %b) { ; CHECK-LABEL: index_rr_i8: ; CHECK: // %bb.0: -; CHECK-NEXT: index z0.b, w0, w1 +; CHECK-NEXT: index z1.b, #0, #1 +; CHECK-NEXT: mov z2.b, w1 +; CHECK-NEXT: mov z0.b, w0 +; CHECK-NEXT: ptrue p0.b +; CHECK-NEXT: mla z0.b, p0/m, z2.b, z1.b I could add here, but as you saw the C++ code in here would not be nicer.
16401	As we also discussed the test if (ScaledOffset) could be removed and Offset could always be multiplied by N->getScale() as it will never be zero.
16403	So my worried was that the constant could be zero. But I checked and this would remove step vector The code is optimized and step is removed, for the correct reason. Index = step(const) * splat(0) -> Index = splat(0)
16412	I pushed this fold to be in a second patch, maybe it will be better to review. In case not I will add these extra variables.

CarolineConcatto added a child revision: D118345: [AArch64][SVE] Add more folds to make use of gather/scatter with 32-bit indices.Jan 27 2022, 3:25 AM

CarolineConcatto added inline comments.Jan 27 2022, 3:26 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16412	This fold is now here: https://reviews.llvm.org/D118345

Harbormaster completed remote builds in B145970: Diff 403575.Jan 27 2022, 10:01 AM

sdesmalen mentioned this in D118459: [ISEL] Canonicalize STEP_VECTOR to LHS if RHS is a splat..Jan 28 2022, 5:13 AM

Thanks for splitting this patch up, that makes it a bit easier to review!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16371	This comment related to the variable `ScaledOffset` you had in the previous revision of this patch, but then you found out that variable could be removed, so this comment then became redundant. I think the function should be named `findMoreOptimalIndexType`.
16383	nit: can you clarify this by adding a variable `SDValue StepVector = Index.getOperand(1)`, and then querying: auto *C = cast<ConstantSDNode>(StepVector.getOperand(0))
16383–16385	the operand to a stepvector is always a constant, so it will never return false here.
16386	`Stride` should only be set if `Offset` is a splat value, otherwise the return value is incorrect.
16386	Stride should only be set in the block below (under `if (auto Offset = DAG.getSplatValue(Index.getOperand(0)))` because otherwise the function returns `true` (Stride != 0), but `BasePtr` will be undefined.
16395	This should also bail out early if `Stride == 0`

Rebase the patch under D118459 and address comments

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16386	Very good catch! Thank you Sander

Harbormaster completed remote builds in B146277: Diff 404009.Jan 28 2022, 8:18 AM

sdesmalen added inline comments.Jan 31 2022, 6:57 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16367	nit: functions should start with lower case, i.e. `findMoreOptimalIndexType`.
16371	nit: redundant newline.
16386	s/dyn_cast/cast/
16392–16399	nit: maybe just merge these conditions together: if (!Stride \|\| Stride < std::numeric_limits<int32_t>::min() \|\| Stride > std::numeric_limits<int32_t>::max()) return false?
16437	nit: you can remove one more level of indentation, by writing it like this: if (!FindMoreOptimalIndexType(MGS, BasePtr, Index, IndexType, DAG)) return SDValue(); if (auto *MGT = ...) { ... return ... } else { ... return ... } llvm_unreachable("Unhandled case");
llvm/test/CodeGen/AArch64/sve-gather-scatter-addr-opts.ll
4	nit: can you group the positive test cases together (`scatter_i8_index_offset_minimum` and `scatter_i8_index_offset_maximum`), followed by the negative test cases?
5	This test does not exist in this patch?
38	both positive tests use `i8` as types, can you also use a different type (e.g. i16/i32), so that we can check the scaling happens correctly? (`sxtw #1` or `sxtw #2`)

david-arm mentioned this in D118345: [AArch64][SVE] Add more folds to make use of gather/scatter with 32-bit indices.Jan 31 2022, 7:58 AM

Address Sander's comments

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16392–16399	I like that test for only zero and the other to only check overflow. They both have comments for why it returns early.

CarolineConcatto edited the summary of this revision. (Show Details)Jan 31 2022, 9:01 AM

In D117900#3279405, @CarolineConcatto wrote:

Rebase the patch under D118459 and address comments

Can you add D118459 as a dependence to this patch (set parent revision)

Thanks for addressing all my comments, the patch looks good to me with final nits addressed.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16390	nit: s/Earlier returns [...] fold change./Return early because no supported pattern is found./
16410	This comment seems misplaced. I think you can remove it, because the code above where it checks the value of LastElementOffset says as much.
llvm/test/CodeGen/AArch64/sve-gather-scatter-addr-opts.ll
7	All these tests use scatters, although the name suggests it should work for gathers too. Can you also add at least one positive test that uses a llvm.masked.gather?
52	nit: This test still does not exist in this file?

Harbormaster completed remote builds in B146661: Diff 404560.Jan 31 2022, 11:25 AM

Add gather test

The commit message was updated, problems is that I forgot to do arc diff --verbatim.
So it does not update the GUI here. I will use next time.

Harbormaster completed remote builds in B146849: Diff 404859.Feb 1 2022, 3:41 AM

sdesmalen accepted this revision.Feb 1 2022, 7:25 AM

This revision is now accepted and ready to land.Feb 1 2022, 7:25 AM

paulwalker-arm added inline comments.Feb 1 2022, 7:44 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16451	Just a flyby comment that rather than this I think the LLVM coding style would suggest removing the else?

Update test gather_i8_index_offset_8

CarolineConcatto marked an inline comment as done.Feb 3 2022, 9:55 AM

-Remove commented line from test gather_i8_index_offset_8

This revision was landed with ongoing or failed builds.Feb 3 2022, 11:00 AM

Closed by commit rG019f0221d52d: [AArch64][SVE] Fold gather/scatter with 32bits when possible (authored by CarolineConcatto). · Explain Why

This revision was automatically updated to reflect the committed changes.

CarolineConcatto added a commit: rG019f0221d52d: [AArch64][SVE] Fold gather/scatter with 32bits when possible.

Harbormaster completed remote builds in B147435: Diff 405695.Feb 3 2022, 11:05 AM

paulwalker-arm added inline comments.Feb 10 2022, 8:28 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16367–16370	Hi @CarolineConcatto, sorry for the too late review comment but I think this function has a serious problem albeit one that isn't going to trigger with its current usage. There are several places where the function returns false after having already changed some of its input reference parameters (i.e. BasePtr). This is unnatural and likely to confuse the unaware. From the original work you'll see that these parameters were only changed when something more optimal is found. With the current format of this function callers will have to backup the parameters, which is just something they're not going to know to do until they've spent time debugging why their "local" state has been trashed. Can you please change the function so that the reference inputs are unchanged when returning false?

sdesmalen mentioned this in D119728: [AArch64][SVE] Handle more cases in findMoreOptimalIndexType..Feb 14 2022, 8:17 AM

sdesmalen mentioned this in rG201e3686ab45: [AArch64][SVE] Handle more cases in findMoreOptimalIndexType..Feb 28 2022, 4:16 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

93 lines

test/

CodeGen/

AArch64/

sve-gather-scatter-addr-opts.ll

212 lines

Diff 405715

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 883 Lines • ▼ Show 20 Lines	#undef LCALLNAME5

setTargetDAGCombine(ISD::INTRINSIC_VOID);		setTargetDAGCombine(ISD::INTRINSIC_VOID);
setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);		setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);
setTargetDAGCombine(ISD::VECREDUCE_ADD);		setTargetDAGCombine(ISD::VECREDUCE_ADD);
setTargetDAGCombine(ISD::STEP_VECTOR);		setTargetDAGCombine(ISD::STEP_VECTOR);

		setTargetDAGCombine(ISD::MGATHER);
		setTargetDAGCombine(ISD::MSCATTER);

setTargetDAGCombine(ISD::FP_EXTEND);		setTargetDAGCombine(ISD::FP_EXTEND);

setTargetDAGCombine(ISD::GlobalAddress);		setTargetDAGCombine(ISD::GlobalAddress);

// In case of strict alignment, avoid an excessive number of byte wide stores.		// In case of strict alignment, avoid an excessive number of byte wide stores.
MaxStoresPerMemsetOptSize = 8;		MaxStoresPerMemsetOptSize = 8;
MaxStoresPerMemset =		MaxStoresPerMemset =
Subtarget->requiresStrictAlign() ? MaxStoresPerMemsetOptSize : 32;		Subtarget->requiresStrictAlign() ? MaxStoresPerMemsetOptSize : 32;
▲ Show 20 Lines • Show All 15,453 Lines • ▼ Show 20 Lines	if (Subtarget->supportsAddressTopByteIgnored() &&
return SDValue(N, 0);		return SDValue(N, 0);

if (SDValue Store = foldTruncStoreOfExt(DAG, N))		if (SDValue Store = foldTruncStoreOfExt(DAG, N))
return Store;		return Store;

return SDValue();		return SDValue();
}		}

		// Analyse the specified address returning true if a more optimal addressing
		// mode is available. When returning true all parameters are updated to reflect
		// their recommended values.
		static bool findMoreOptimalIndexType(const MaskedGatherScatterSDNode *N,
		sdesmalenUnsubmitted Done Reply Inline Actions nit: functions should start with lower case, i.e. `findMoreOptimalIndexType`. sdesmalen: nit: functions should start with lower case, i.e. `findMoreOptimalIndexType`.
		SDValue &BasePtr, SDValue &Index,
		ISD::MemIndexType &IndexType,
		SelectionDAG &DAG) {
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Hi @CarolineConcatto, sorry for the too late review comment but I think this function has a serious problem albeit one that isn't going to trigger with its current usage. There are several places where the function returns false after having already changed some of its input reference parameters (i.e. BasePtr). This is unnatural and likely to confuse the unaware. From the original work you'll see that these parameters were only changed when something more optimal is found. With the current format of this function callers will have to backup the parameters, which is just something they're not going to know to do until they've spent time debugging why their "local" state has been trashed. Can you please change the function so that the reference inputs are unchanged when returning false? paulwalker-arm: Hi @CarolineConcatto, sorry for the too late review comment but I think this function has a…
		// Only consider element types that are pointer sized as smaller types can
		sdesmalenUnsubmitted Done Reply Inline Actions nit: The word 'Offset' is used quite a few times in this function, which confused me a bit. I think what this is saying is that the eventual instruction will scale the offset (i.e. the offset is not an offset, but an index). So maybe change this to `OffsetIsScaledByInstruction`? sdesmalen: nit: The word 'Offset' is used quite a few times in this function, which confused me a bit. I…
		sdesmalenUnsubmitted Not Done Reply Inline Actions This comment related to the variable `ScaledOffset` you had in the previous revision of this patch, but then you found out that variable could be removed, so this comment then became redundant. I think the function should be named `findMoreOptimalIndexType`. sdesmalen: This comment related to the variable `ScaledOffset` you had in the previous revision of this…
		sdesmalenUnsubmitted Done Reply Inline Actions nit: redundant newline. sdesmalen: nit: redundant newline.
		// be easily promoted.
		EVT IndexVT = Index.getValueType();
		if (IndexVT.getVectorElementType() != MVT::i64 \|\| IndexVT == MVT::nxv2i64)
		return false;

		int64_t Stride = 0;
		SDLoc DL(N);
		// Index = step(const) + splat(offset)
		if (Index.getOpcode() == ISD::ADD &&
		Index.getOperand(0).getOpcode() == ISD::STEP_VECTOR) {
		SDValue StepVector = Index.getOperand(0);
		if (auto Offset = DAG.getSplatValue(Index.getOperand(1))) {
		sdesmalenUnsubmitted Done Reply Inline Actions nit: can you clarify this by adding a variable `SDValue StepVector = Index.getOperand(1)`, and then querying: auto C = cast<ConstantSDNode>(StepVector.getOperand(0)) sdesmalen:* nit: can you clarify this by adding a variable `SDValue StepVector = Index.getOperand(1)`, and…
		Stride = cast<ConstantSDNode>(StepVector.getOperand(0))->getSExtValue();
		Offset = DAG.getNode(ISD::MUL, DL, MVT::i64, Offset, N->getScale());
		sdesmalenUnsubmitted Done Reply Inline Actions nit: this is equivalent to: // Only consider element types that are pointer sized as smaller types can // be easily promoted. Also ignore indices that are already type legal. EVT IndexVT = Index.getValueType(); if (IndexVT.getVectorElementType() != MVT::i64 \|\| IndexVT == MVT::nxv2i64) return false; sdesmalen: nit: this is equivalent to: // Only consider element types that are pointer sized as smaller…
		sdesmalenUnsubmitted Done Reply Inline Actions the operand to a stepvector is always a constant, so it will never return false here. sdesmalen: the operand to a stepvector is always a constant, so it will never return false here.
		BasePtr = DAG.getNode(ISD::ADD, DL, MVT::i64, BasePtr, Offset);
		sdesmalenUnsubmitted Done Reply Inline Actions `Stride` should only be set if `Offset` is a splat value, otherwise the return value is incorrect. sdesmalen: `Stride` should //only// be set if `Offset` is a splat value, otherwise the return value is…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Stride should only be set in the block below (under `if (auto Offset = DAG.getSplatValue(Index.getOperand(0)))` because otherwise the function returns `true` (Stride != 0), but `BasePtr` will be undefined. sdesmalen: Stride should only be set in the block below (under `if (auto Offset = DAG.getSplatValue(Index.
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Very good catch! Thank you Sander CarolineConcatto: Very good catch! Thank you Sander
		sdesmalenUnsubmitted Done Reply Inline Actions s/dyn_cast/cast/ sdesmalen: s/dyn_cast/cast/
		}
		sdesmalenUnsubmitted Done Reply Inline Actions nit: IndexVT isn't used until line 16173, so don't change it here but rather closest to where it's used. Also, I'd suggest storing it a new variable, named e.g. `NewIndexVT`. sdesmalen: nit: IndexVT isn't used until line 16173, so don't change it here but rather closest to where…
		}
		sdesmalenUnsubmitted Done Reply Inline Actions Maybe rename this to `Stride` ? sdesmalen: Maybe rename this to `Stride` ?

		// Return early because no supported pattern is found.
		sdesmalenUnsubmitted Done Reply Inline Actions nit: s/Earlier returns [...] fold change./Return early because no supported pattern is found./ sdesmalen: nit: s/Earlier returns [...] fold change./Return early because no supported pattern is found./
		if (Stride == 0)
		return false;

		if (Stride < std::numeric_limits<int32_t>::min() \|\|
		sdesmalenUnsubmitted Done Reply Inline Actions This makes the assumption that STEP_VECTOR is on the right-hand side. There is currently no code that forces this, so you'll want to add a condition to SelectionDAG::getNode() where it reorders the constants to the right, to do something like this: - if ((IsN1C && !IsN2C) \|\| (IsN1CFP && !IsN2CFP)) + + // Favour step_vector on the RHS because it's a kind of + // constant. + if (((N1.getOpcode() == ISD::STEP_VECTOR \|\| IsN1C) && !IsN2C) \|\| + (IsN1CFP && !IsN2CFP)) sdesmalen: This makes the assumption that STEP_VECTOR is on the right-hand side. There is currently no…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions I don't think it is a good solution to add this in SelectionDAG. I could be wrong, but when I do this suggestion I see regressions in sve-stepvector.ll and sve-intrinsics-index. Instead on only 1 instruction it becomes 5. For instance, define <vscale x 16 x i8> @index_rr_i8(i8 %a, i8 %b) { ; CHECK-LABEL: index_rr_i8: ; CHECK: // %bb.0: -; CHECK-NEXT: index z0.b, w0, w1 +; CHECK-NEXT: index z1.b, #0, #1 +; CHECK-NEXT: mov z2.b, w1 +; CHECK-NEXT: mov z0.b, w0 +; CHECK-NEXT: ptrue p0.b +; CHECK-NEXT: mla z0.b, p0/m, z2.b, z1.b I could add here, but as you saw the C++ code in here would not be nicer. CarolineConcatto: I don't think it is a good solution to add this in SelectionDAG. I could be wrong, but when I…
		Stride > std::numeric_limits<int32_t>::max())
		sdesmalenUnsubmitted Done Reply Inline Actions If `Index.getOperand(1)` is ISD::STEP_VECTOR, then it's operand must be a constant. There's no need to test for it, you can call `Index.getOperand(0).getConstantOperandVal(0)` directly. sdesmalen: If `Index.getOperand(1)` is ISD::STEP_VECTOR, then it's operand must be a constant. There's…
		sdesmalenUnsubmitted Done Reply Inline Actions This should also bail out early if `Stride == 0` sdesmalen: This should also bail out early if `Stride == 0`
		return false;

		const auto &Subtarget =
		static_cast<const AArch64Subtarget &>(DAG.getSubtarget());
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: maybe just merge these conditions together: if (!Stride \|\| Stride < std::numeric_limits<int32_t>::min() \|\| Stride > std::numeric_limits<int32_t>::max()) return false? sdesmalen: nit: maybe just merge these conditions together: if (!Stride \|\| Stride < std…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions I like that test for only zero and the other to only check overflow. They both have comments for why it returns early. CarolineConcatto: I like that test for only zero and the other to only check overflow. They both have comments…
		unsigned MaxVScale =
		Subtarget.getMaxSVEVectorSizeInBits() / AArch64::SVEBitsPerBlock;
		sdesmalenUnsubmitted Done Reply Inline Actions nit: Does `N->getScale()` always return a value even when ScaledOffset == false? If so, and it returns a Constant `1` by default, then you can remove the `if (ScaledOffset)` condition above. sdesmalen: nit: Does `N->getScale()` always return a value even when ScaledOffset == false? If so, and it…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions As we also discussed the test if (ScaledOffset) could be removed and Offset could always be multiplied by N->getScale() as it will never be zero. CarolineConcatto: As we also discussed the test ``` if (ScaledOffset) ``` could be removed and Offset could…
		int64_t LastElementOffset =
		IndexVT.getVectorMinNumElements() * Stride * MaxVScale;
		sdesmalenUnsubmitted Done Reply Inline Actions If `StepConst` is only set in this block (or in general, when we know for sure that we can fold this into a simpler addressing mode), then ChangedIndex is redundant, and you can just check that StepConst != 0. sdesmalen: If `StepConst` is only set in this block (or in general, when we know for sure that we can fold…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions So my worried was that the constant could be zero. But I checked and this would remove step vector The code is optimized and step is removed, for the correct reason. Index = step(const) * splat(0) -> Index = splat(0) CarolineConcatto: So my worried was that the constant could be zero. But I checked and this would remove step…

		if (LastElementOffset < std::numeric_limits<int32_t>::min() \|\|
		LastElementOffset > std::numeric_limits<int32_t>::max())
		return false;

		EVT NewIndexVT = IndexVT.changeVectorElementType(MVT::i32);
		Index = DAG.getNode(ISD::STEP_VECTOR, DL, NewIndexVT,
		sdesmalenUnsubmitted Done Reply Inline Actions This comment seems misplaced. I think you can remove it, because the code above where it checks the value of LastElementOffset says as much. sdesmalen: This comment seems misplaced. I think you can remove it, because the code above where it checks…
		DAG.getTargetConstant(Stride, DL, MVT::i32));
		return true;
		sdesmalenUnsubmitted Not Done Reply Inline Actions This isn't very easy to read, I'd recommend creating three variables instead: if (Index.getOpcode() == ISD::SHL && Index.getOperand(0).getOpcode() == ISD::ADD && Index.getOperand(0).getOperand(1).getOpcode() == ISD::STEP_VECTOR) { SDValue ShiftAmount = Index.getOperand(1); SDValue BaseOffset = Index.getOperand(0).getOperand(0); SDValue Step = Index.getOperand(0).getOperand(1); if (auto Shift = DAG.getSplatValue(ShiftAmount)) if (auto ShiftC = dyn_cast<ConstantSDNode>(Shift)) if (auto Offset = DAG.getSplatValue(BaseOffset)) { // ... code to calculate Stride, Offset and BasePtr here ... sdesmalen:* This isn't very easy to read, I'd recommend creating three variables instead: if (Index.
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions I pushed this fold to be in a second patch, maybe it will be better to review. In case not I will add these extra variables. CarolineConcatto: I pushed this fold to be in a second patch, maybe it will be better to review. In case not I…
		CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions This fold is now here: https://reviews.llvm.org/D118345 CarolineConcatto: This fold is now here: https://reviews.llvm.org/D118345
		}

		static SDValue performMaskedGatherScatterCombine(
		SDNode *N, TargetLowering::DAGCombinerInfo &DCI, SelectionDAG &DAG) {
		MaskedGatherScatterSDNode *MGS = cast<MaskedGatherScatterSDNode>(N);
		assert(MGS && "Can only combine gather load or scatter store nodes");

		if (!DCI.isBeforeLegalize())
		return SDValue();

		SDLoc DL(MGS);
		SDValue Chain = MGS->getChain();
		SDValue Scale = MGS->getScale();
		SDValue Index = MGS->getIndex();
		SDValue Mask = MGS->getMask();
		SDValue BasePtr = MGS->getBasePtr();
		ISD::MemIndexType IndexType = MGS->getIndexType();

		if (!findMoreOptimalIndexType(MGS, BasePtr, Index, IndexType, DAG))
		return SDValue();

		// Here we catch such cases early and change MGATHER's IndexType to allow
		// the use of an Index that's more legalisation friendly.
		if (auto *MGT = dyn_cast<MaskedGatherSDNode>(MGS)) {
		SDValue PassThru = MGT->getPassThru();
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: you can remove one more level of indentation, by writing it like this: if (!FindMoreOptimalIndexType(MGS, BasePtr, Index, IndexType, DAG)) return SDValue(); if (auto MGT = ...) { ... return ... } else { ... return ... } llvm_unreachable("Unhandled case"); sdesmalen:* nit: you can remove one more level of indentation, by writing it like this: if (!
		SDValue Ops[] = {Chain, PassThru, Mask, BasePtr, Index, Scale};
		return DAG.getMaskedGather(
		sdesmalenUnsubmitted Done Reply Inline Actions You can more easily calculate MaxVScale from: const auto &Subtarget = static_cast<const AArch64Subtarget &>(DAG.getSubtarget()); unsigned MaxVScale = Subtarget.getMaxSVEVectorSizeInBits() / AArch64::SVEBitsPerBlock; sdesmalen: You can more easily calculate MaxVScale from: const auto &Subtarget =…
		DAG.getVTList(N->getValueType(0), MVT::Other), MGT->getMemoryVT(), DL,
		Ops, MGT->getMemOperand(), IndexType, MGT->getExtensionType());
		}
		auto *MSC = cast<MaskedScatterSDNode>(MGS);
		SDValue Data = MSC->getValue();
		SDValue Ops[] = {Chain, Data, Mask, BasePtr, Index, Scale};
		return DAG.getMaskedScatter(DAG.getVTList(MVT::Other), MSC->getMemoryVT(), DL,
		Ops, MSC->getMemOperand(), IndexType,
		MSC->isTruncatingStore());
		}

/// Target-specific DAG combine function for NEON load/store intrinsics		/// Target-specific DAG combine function for NEON load/store intrinsics
		paulwalker-armUnsubmitted Done Reply Inline Actions Just a flyby comment that rather than this I think the LLVM coding style would suggest removing the else? paulwalker-arm: Just a flyby comment that rather than this I think the LLVM coding style would suggest removing…
/// to merge base address updates.		/// to merge base address updates.
static SDValue performNEONPostLDSTCombine(SDNode *N,		static SDValue performNEONPostLDSTCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())		if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
return SDValue();		return SDValue();
		sdesmalenUnsubmitted Done Reply Inline Actions Maybe just move all these definitions after the `if(!DCI.isBeforeLegalize()) return SDValue();` (that is, if you take my suggestion mentioned below). sdesmalen: Maybe just move all these definitions after the `if(!DCI.isBeforeLegalize()) return SDValue();`…

unsigned AddrOpIdx = N->getNumOperands() - 1;		unsigned AddrOpIdx = N->getNumOperands() - 1;
SDValue Addr = N->getOperand(AddrOpIdx);		SDValue Addr = N->getOperand(AddrOpIdx);

// Search for a use of the address operand that is an increment.		// Search for a use of the address operand that is an increment.
for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),		for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),
UE = Addr.getNode()->use_end(); UI != UE; ++UI) {		UE = Addr.getNode()->use_end(); UI != UE; ++UI) {
SDNode User = UI;		SDNode User = UI;
		sdesmalenUnsubmitted Done Reply Inline Actions nit: just bail out early to avoid a level of indentation, e.g. if (!DCI.isBeforeLegalize()) return SDValue(); if (FindMoreOptimalIndexType(...)) { ... } sdesmalen: nit: just bail out early to avoid a level of indentation, e.g. if (!DCI.isBeforeLegalize())…
if (User->getOpcode() != ISD::ADD \|\|		if (User->getOpcode() != ISD::ADD \|\|
UI.getUse().getResNo() != Addr.getResNo())		UI.getUse().getResNo() != Addr.getResNo())
continue;		continue;

// Check that the add is independent of the load/store. Otherwise, folding		// Check that the add is independent of the load/store. Otherwise, folding
// it would create a cycle.		// it would create a cycle.
SmallPtrSet<const SDNode *, 32> Visited;		SmallPtrSet<const SDNode *, 32> Visited;
SmallVector<const SDNode *, 16> Worklist;		SmallVector<const SDNode *, 16> Worklist;
▲ Show 20 Lines • Show All 1,431 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::SETCC:		case ISD::SETCC:
return performSETCCCombine(N, DAG);		return performSETCCCombine(N, DAG);
case ISD::LOAD:		case ISD::LOAD:
if (performTBISimplification(N->getOperand(1), DCI, DAG))		if (performTBISimplification(N->getOperand(1), DCI, DAG))
return SDValue(N, 0);		return SDValue(N, 0);
break;		break;
case ISD::STORE:		case ISD::STORE:
return performSTORECombine(N, DCI, DAG, Subtarget);		return performSTORECombine(N, DCI, DAG, Subtarget);
		case ISD::MGATHER:
		case ISD::MSCATTER:
		return performMaskedGatherScatterCombine(N, DCI, DAG);
case ISD::VECTOR_SPLICE:		case ISD::VECTOR_SPLICE:
		sdesmalenUnsubmitted Done Reply Inline Actions nit: case ISD::MGATHER: case ISD::MSCATTER: return performMaskedGatherScatterCombine(N, DCI, DAG); ? sdesmalen: nit: case ISD::MGATHER: case ISD::MSCATTER: return performMaskedGatherScatterCombine…
return performSVESpliceCombine(N, DAG);		return performSVESpliceCombine(N, DAG);
case ISD::FP_EXTEND:		case ISD::FP_EXTEND:
return performFPExtendCombine(N, DAG, DCI, Subtarget);		return performFPExtendCombine(N, DAG, DCI, Subtarget);
case AArch64ISD::BRCOND:		case AArch64ISD::BRCOND:
return performBRCONDCombine(N, DCI, DAG);		return performBRCONDCombine(N, DCI, DAG);
case AArch64ISD::TBNZ:		case AArch64ISD::TBNZ:
case AArch64ISD::TBZ:		case AArch64ISD::TBZ:
return performTBZCombine(N, DCI, DAG);		return performTBZCombine(N, DCI, DAG);
▲ Show 20 Lines • Show All 2,286 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-gather-scatter-addr-opts.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=aarch64-linux-unknown \| FileCheck %s


				sdesmalenUnsubmitted Done Reply Inline Actions nit: can you group the positive test cases together (`scatter_i8_index_offset_minimum` and `scatter_i8_index_offset_maximum`), followed by the negative test cases? sdesmalen: nit: can you group the positive test cases together (`scatter_i8_index_offset_minimum` and…
				; Ensure we use a "vscale x 4" wide scatter for the maximum supported offset.
				sdesmalenUnsubmitted Done Reply Inline Actions This test does not exist in this patch? sdesmalen: This test does not exist in this patch?
				define void @scatter_i8_index_offset_maximum(i8* %base, i64 %offset, <vscale x 4 x i1> %pg, <vscale x 4 x i8> %data) #0 {
				; CHECK-LABEL: scatter_i8_index_offset_maximum:
				sdesmalenUnsubmitted Done Reply Inline Actions All these tests use scatters, although the name suggests it should work for gathers too. Can you also add at least one positive test that uses a llvm.masked.gather? sdesmalen: All these tests use scatters, although the name suggests it should work for gathers too. Can…
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #33554431
				; CHECK-NEXT: add x9, x0, x1
				; CHECK-NEXT: index z1.s, #0, w8
				; CHECK-NEXT: st1b { z0.s }, p0, [x9, z1.s, sxtw]
				; CHECK-NEXT: ret
				%t0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0
				%t1 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%t2 = insertelement <vscale x 4 x i64> undef, i64 33554431, i32 0
				%t3 = shufflevector <vscale x 4 x i64> %t2, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				%t4 = mul <vscale x 4 x i64> %t3, %step
				%t5 = add <vscale x 4 x i64> %t1, %t4
				%t6 = getelementptr i8, i8* %base, <vscale x 4 x i64> %t5
				call void @llvm.masked.scatter.nxv4i8(<vscale x 4 x i8> %data, <vscale x 4 x i8*> %t6, i32 2, <vscale x 4 x i1> %pg)
				ret void
				}

				; Ensure we use a "vscale x 4" wide scatter for the minimum supported offset.
				define void @scatter_i16_index_offset_minimum(i16* %base, i64 %offset, <vscale x 4 x i1> %pg, <vscale x 4 x i16> %data) #0 {
				; CHECK-LABEL: scatter_i16_index_offset_minimum:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #-33554432
				; CHECK-NEXT: add x9, x0, x1, lsl #1
				; CHECK-NEXT: index z1.s, #0, w8
				; CHECK-NEXT: st1h { z0.s }, p0, [x9, z1.s, sxtw #1]
				; CHECK-NEXT: ret
				%t0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0
				%t1 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%t2 = insertelement <vscale x 4 x i64> undef, i64 -33554432, i32 0
				%t3 = shufflevector <vscale x 4 x i64> %t2, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				sdesmalenUnsubmitted Done Reply Inline Actions both positive tests use `i8` as types, can you also use a different type (e.g. i16/i32), so that we can check the scaling happens correctly? (`sxtw #1` or `sxtw #2`) sdesmalen: both positive tests use `i8` as types, can you also use a different type (e.g. i16/i32), so…
				%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				%t4 = mul <vscale x 4 x i64> %t3, %step
				%t5 = add <vscale x 4 x i64> %t1, %t4
				%t6 = getelementptr i16, i16* %base, <vscale x 4 x i64> %t5
				call void @llvm.masked.scatter.nxv4i16(<vscale x 4 x i16> %data, <vscale x 4 x i16*> %t6, i32 2, <vscale x 4 x i1> %pg)
				ret void
				}

				; Ensure we use a "vscale x 4" gather for an offset in the limits of 32 bits.
				define <vscale x 4 x i8> @gather_i8_index_offset_8(i8* %base, i64 %offset, <vscale x 4 x i1> %pg) #0 {
				; CHECK-LABEL: gather_i8_index_offset_8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: add x8, x0, x1
				; CHECK-NEXT: index z0.s, #0, #1
				sdesmalenUnsubmitted Done Reply Inline Actions nit: This test still does not exist in this file? sdesmalen: nit: This test still does not exist in this file?
				; CHECK-NEXT: ld1sb { z0.s }, p0/z, [x8, z0.s, sxtw]
				; CHECK-NEXT: ret
				%splat.insert0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0
				%splat0 = shufflevector <vscale x 4 x i64> %splat.insert0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				%splat.insert1 = insertelement <vscale x 4 x i64> undef, i64 1, i32 0
				%splat1 = shufflevector <vscale x 4 x i64> %splat.insert1, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%t1 = mul <vscale x 4 x i64> %splat1, %step
				%t2 = add <vscale x 4 x i64> %splat0, %t1
				%t3 = getelementptr i8, i8* %base, <vscale x 4 x i64> %t2
				%load = call <vscale x 4 x i8> @llvm.masked.gather.nxv4i8(<vscale x 4 x i8*> %t3, i32 4, <vscale x 4 x i1> %pg, <vscale x 4 x i8> undef)
				ret <vscale x 4 x i8> %load
				}

				;; Negative tests

				; Ensure we don't use a "vscale x 4" scatter. Cannot prove that variable stride
				; will not wrap when shrunk to be i32 based.
				define void @scatter_f16_index_offset_var(half* %base, i64 %offset, i64 %scale, <vscale x 4 x i1> %pg, <vscale x 4 x half> %data) #0 {
				; CHECK-LABEL: scatter_f16_index_offset_var:
				; CHECK: // %bb.0:
				; CHECK-NEXT: index z1.d, #0, #1
				; CHECK-NEXT: mov z3.d, x1
				; CHECK-NEXT: mov z2.d, z1.d
				; CHECK-NEXT: mov z4.d, z3.d
				; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: incd z2.d
				; CHECK-NEXT: mla z3.d, p1/m, z1.d, z3.d
				; CHECK-NEXT: mla z4.d, p1/m, z2.d, z4.d
				; CHECK-NEXT: punpklo p1.h, p0.b
				; CHECK-NEXT: uunpklo z1.d, z0.s
				; CHECK-NEXT: punpkhi p0.h, p0.b
				; CHECK-NEXT: uunpkhi z0.d, z0.s
				; CHECK-NEXT: st1h { z1.d }, p1, [x0, z3.d, lsl #1]
				; CHECK-NEXT: st1h { z0.d }, p0, [x0, z4.d, lsl #1]
				; CHECK-NEXT: ret
				%t0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0
				%t1 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%t2 = insertelement <vscale x 4 x i64> undef, i64 %scale, i32 0
				%t3 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				%t4 = mul <vscale x 4 x i64> %t3, %step
				%t5 = add <vscale x 4 x i64> %t1, %t4
				%t6 = getelementptr half, half* %base, <vscale x 4 x i64> %t5
				call void @llvm.masked.scatter.nxv4f16(<vscale x 4 x half> %data, <vscale x 4 x half*> %t6, i32 2, <vscale x 4 x i1> %pg)
				ret void
				}

				; Ensure we don't use a "vscale x 4" wide scatter when the offset is too big.
				define void @scatter_i8_index_offset_maximum_plus_one(i8* %base, i64 %offset, <vscale x 4 x i1> %pg, <vscale x 4 x i8> %data) #0 {
				; CHECK-LABEL: scatter_i8_index_offset_maximum_plus_one:
				; CHECK: // %bb.0:
				; CHECK-NEXT: rdvl x8, #1
				; CHECK-NEXT: mov w9, #67108864
				; CHECK-NEXT: lsr x8, x8, #4
				; CHECK-NEXT: mov z1.d, x1
				; CHECK-NEXT: punpklo p1.h, p0.b
				; CHECK-NEXT: punpkhi p0.h, p0.b
				; CHECK-NEXT: mul x8, x8, x9
				; CHECK-NEXT: mov w9, #33554432
				; CHECK-NEXT: index z2.d, #0, x9
				; CHECK-NEXT: mov z3.d, x8
				; CHECK-NEXT: add z3.d, z2.d, z3.d
				; CHECK-NEXT: add z2.d, z2.d, z1.d
				; CHECK-NEXT: add z1.d, z3.d, z1.d
				; CHECK-NEXT: uunpklo z3.d, z0.s
				; CHECK-NEXT: uunpkhi z0.d, z0.s
				; CHECK-NEXT: st1b { z3.d }, p1, [x0, z2.d]
				; CHECK-NEXT: st1b { z0.d }, p0, [x0, z1.d]
				; CHECK-NEXT: ret
				%t0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0
				%t1 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%t2 = insertelement <vscale x 4 x i64> undef, i64 33554432, i32 0
				%t3 = shufflevector <vscale x 4 x i64> %t2, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				%t4 = mul <vscale x 4 x i64> %t3, %step
				%t5 = add <vscale x 4 x i64> %t1, %t4
				%t6 = getelementptr i8, i8* %base, <vscale x 4 x i64> %t5
				call void @llvm.masked.scatter.nxv4i8(<vscale x 4 x i8> %data, <vscale x 4 x i8*> %t6, i32 2, <vscale x 4 x i1> %pg)
				ret void
				}

				; Ensure we don't use a "vscale x 4" wide scatter when the offset is too small.
				define void @scatter_i8_index_offset_minimum_minus_one(i8* %base, i64 %offset, <vscale x 4 x i1> %pg, <vscale x 4 x i8> %data) #0 {
				; CHECK-LABEL: scatter_i8_index_offset_minimum_minus_one:
				; CHECK: // %bb.0:
				; CHECK-NEXT: rdvl x8, #1
				; CHECK-NEXT: mov x9, #-2
				; CHECK-NEXT: lsr x8, x8, #4
				; CHECK-NEXT: movk x9, #64511, lsl #16
				; CHECK-NEXT: mov z1.d, x1
				; CHECK-NEXT: punpklo p1.h, p0.b
				; CHECK-NEXT: mul x8, x8, x9
				; CHECK-NEXT: mov x9, #-33554433
				; CHECK-NEXT: punpkhi p0.h, p0.b
				; CHECK-NEXT: index z2.d, #0, x9
				; CHECK-NEXT: mov z3.d, x8
				; CHECK-NEXT: add z3.d, z2.d, z3.d
				; CHECK-NEXT: add z2.d, z2.d, z1.d
				; CHECK-NEXT: add z1.d, z3.d, z1.d
				; CHECK-NEXT: uunpklo z3.d, z0.s
				; CHECK-NEXT: uunpkhi z0.d, z0.s
				; CHECK-NEXT: st1b { z3.d }, p1, [x0, z2.d]
				; CHECK-NEXT: st1b { z0.d }, p0, [x0, z1.d]
				; CHECK-NEXT: ret
				%t0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0
				%t1 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%t2 = insertelement <vscale x 4 x i64> undef, i64 -33554433, i32 0
				%t3 = shufflevector <vscale x 4 x i64> %t2, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				%t4 = mul <vscale x 4 x i64> %t3, %step
				%t5 = add <vscale x 4 x i64> %t1, %t4
				%t6 = getelementptr i8, i8* %base, <vscale x 4 x i64> %t5
				call void @llvm.masked.scatter.nxv4i8(<vscale x 4 x i8> %data, <vscale x 4 x i8*> %t6, i32 2, <vscale x 4 x i1> %pg)
				ret void
				}

				; Ensure we don't use a "vscale x 4" wide scatter when the stride is too big .
				define void @scatter_i8_index_stride_too_big(i8* %base, i64 %offset, <vscale x 4 x i1> %pg, <vscale x 4 x i8> %data) #0 {
				; CHECK-LABEL: scatter_i8_index_stride_too_big:
				; CHECK: // %bb.0:
				; CHECK-NEXT: rdvl x8, #1
				; CHECK-NEXT: mov x9, #-9223372036854775808
				; CHECK-NEXT: lsr x8, x8, #4
				; CHECK-NEXT: mov z1.d, x1
				; CHECK-NEXT: punpklo p1.h, p0.b
				; CHECK-NEXT: punpkhi p0.h, p0.b
				; CHECK-NEXT: mul x8, x8, x9
				; CHECK-NEXT: mov x9, #4611686018427387904
				; CHECK-NEXT: index z2.d, #0, x9
				; CHECK-NEXT: mov z3.d, x8
				; CHECK-NEXT: add z3.d, z2.d, z3.d
				; CHECK-NEXT: add z2.d, z2.d, z1.d
				; CHECK-NEXT: add z1.d, z3.d, z1.d
				; CHECK-NEXT: uunpklo z3.d, z0.s
				; CHECK-NEXT: uunpkhi z0.d, z0.s
				; CHECK-NEXT: st1b { z3.d }, p1, [x0, z2.d]
				; CHECK-NEXT: st1b { z0.d }, p0, [x0, z1.d]
				; CHECK-NEXT: ret
				%t0 = insertelement <vscale x 4 x i64> undef, i64 %offset, i32 0
				%t1 = shufflevector <vscale x 4 x i64> %t0, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%t2 = insertelement <vscale x 4 x i64> undef, i64 4611686018427387904, i32 0
				%t3 = shufflevector <vscale x 4 x i64> %t2, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%step = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				%t4 = mul <vscale x 4 x i64> %t3, %step
				%t5 = add <vscale x 4 x i64> %t1, %t4
				%t6 = getelementptr i8, i8* %base, <vscale x 4 x i64> %t5
				call void @llvm.masked.scatter.nxv4i8(<vscale x 4 x i8> %data, <vscale x 4 x i8*> %t6, i32 2, <vscale x 4 x i1> %pg)
				ret void
				}


				attributes #0 = { "target-features"="+sve" vscale_range(1, 16) }


				declare <vscale x 4 x i8> @llvm.masked.gather.nxv4i8(<vscale x 4 x i8*>, i32, <vscale x 4 x i1>, <vscale x 4 x i8>)
				declare void @llvm.masked.scatter.nxv4i8(<vscale x 4 x i8>, <vscale x 4 x i8*>, i32, <vscale x 4 x i1>)
				declare void @llvm.masked.scatter.nxv4i16(<vscale x 4 x i16>, <vscale x 4 x i16*>, i32, <vscale x 4 x i1>)
				declare void @llvm.masked.scatter.nxv4f16(<vscale x 4 x half>, <vscale x 4 x half*>, i32, <vscale x 4 x i1>)
				declare <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()