This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
10/19
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-masked-loads.ll

Differential D111221

[AArch64][SVE] Improve code generation for VLS i1 masks
ClosedPublic

Authored by DavidTruby on Oct 6 2021, 5:33 AM.

Download Raw Diff

Details

Reviewers

efriedma
paulwalker-arm
peterwaller-arm
MattDevereau
bsmith

Commits

rG7e44eb079d99: [AArch64][SVE] Improve code generation for VLS i1 masks

Summary

This patch partially resolves an issue for VLS code generation
where a mask is generated from a smaller width integer comparison
than the instruction using the mask requires.

Instead of sign extending a p register by converting it to a z
register, extending that, and converting back, we instead just
do an unpack of the p register.

A separate issue causes the code generation to still be poor when
the mask generation would fit in a neon register, as we then use
a neon comparison operation and have to convert that to a p register.
This will be resolved in a separate patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

DavidTruby created this revision.Oct 6 2021, 5:33 AM

Herald added a reviewer: efriedma. · View Herald TranscriptOct 6 2021, 5:33 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

DavidTruby requested review of this revision.Oct 6 2021, 5:33 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 6 2021, 5:33 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

DavidTruby added reviewers: paulwalker-arm, peterwaller-arm, MattDevereau, bsmith.Oct 6 2021, 5:35 AM

Harbormaster completed remote builds in B127268: Diff 377509.Oct 6 2021, 6:05 AM

peterwaller-arm added inline comments.Oct 6 2021, 6:24 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15344	nit. As many of the other combines have, please could you introduce a comment showing the form of the intended combine?
16036	nit. As many of the other combines have, please could you introduce a comment showing the form of the combine?

paulwalker-arm added inline comments.Oct 6 2021, 9:49 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16044	I believe this transform is sound for a normal `SETCC` but `N` is a `SETCC_MERGE_ZERO` and so consideration must be paid to the predicate to ensure the inactive lanes get zero'd. The easiest way to make the combine safe is to check `Pred` is all active. The problem with doing that is it'll probably mean the combine doesn't fire for the case you care about.

Matt added a subscriber: Matt.Oct 6 2021, 10:45 AM

Added an extra condition to make the transformation sound.

In essence, we need to check that the predicate is the same
before and after the transform, modulo its size since we're
moving from a smaller to larger predicate.

DavidTruby marked an inline comment as done.Oct 12 2021, 5:40 AM

Harbormaster completed remote builds in B128341: Diff 378986.Oct 12 2021, 6:49 AM

bsmith added inline comments.Oct 25 2021, 3:34 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
36	Rogue include?
16046	Is this assuming that Pred and OrigPred and ptrues? If they're not, won't this break?

Add additional check that predicates are ptrues

DavidTruby marked 2 inline comments as done.Oct 27 2021, 5:57 AM

Harbormaster completed remote builds in B130933: Diff 382634.Oct 27 2021, 7:27 AM

Add negative tests for the case where the ptrue vl is not the same, or where
one of the predicates is not a ptrue at all.

Harbormaster completed remote builds in B134728: Diff 387924.Nov 17 2021, 6:51 AM

peterwaller-arm added inline comments.Nov 18 2021, 6:19 AM

llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
28 ↗	(On Diff #387924)	ptrue

Improve added ACLE intrinsic tests

Fix spelling mistake

Fix tests after spelling mistake (whoops...)

Harbormaster completed remote builds in B135417: Diff 388904.Nov 22 2021, 12:09 PM

peterwaller-arm added inline comments.Nov 23 2021, 7:00 AM

llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
18 ↗	(On Diff #388904)	Unused %dup? (I guess you replaced it with zeroinitializer?)

paulwalker-arm added inline comments.Nov 23 2021, 10:22 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15344	This comment doesn't match the code below. At the DAG level this is really `sunpklo(sext(pred)) -> sext(extract_low_half(pred))`. Also, is this transform universally good? As in, if the later transformation does not occur, is punpklo always better than sunpklo?
15352	Can you be explicit here given we know we always want a sign extend?
16045	Perhaps I've confused myself with the number of `getOperand()` calls but this doesn't looks quite right: LHS == ISD::SIGN_EXTEND LHS->getOperand(0) == ISD::EXTRACT_SUBVECTOR LHS->getOperand(0)->getOperand(0) == vector we're extracting the subvector from LHS->getOperand(0)->getOperand(0)->getOperand(0) == ???? and how do you know LHS->getOperand(0)->getOperand(0) has an operand to get? I can see the following code then ensures `LHS->getOperand(0)->getOperand(0)->getOperand(0)` is a `PTRUE` but before then depending on `LHS->getOperand(0)->getOperand(0)` the extra call to `getOperand(0)` might assert. Also, is it really the case that is doesn't matter how the `PTRUE` is processed before being passed to `ISD::EXTRACT_SUBVECTOR`?

Refactor SetCC combine and rebase

replace getSextOrTrunc with getNode(SIGN_EXTEND)

Update comments

DavidTruby marked 3 inline comments as done.Dec 9 2021, 5:35 AM

DavidTruby added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15344	I'm not sure that the transform is universally good, however it does significantly simplify the following transform. This transform has the effect of moving the setcc that converts from a i1 register to another register after the sign_extend, which becomes just a single sign extend rather than a chain of unpacks of unknown length. This is a much easier pattern to recognize later.
16045	I've refactored all of this to make it a bit clearer what's going on, and added an additional check so that we can make sure that `getOperand(0)` actually exists.

Harbormaster completed remote builds in B138421: Diff 393126.Dec 9 2021, 6:24 AM

peterwaller-arm added inline comments.Dec 9 2021, 8:47 AM

llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
224 ↗	(On Diff #393126)	I believe these should be spelled "v2i16", etc?

Fix up acle tests

DavidTruby marked 3 inline comments as done.Dec 13 2021, 5:11 AM

DavidTruby marked 2 inline comments as done.Dec 13 2021, 5:31 AM

Harbormaster completed remote builds in B138943: Diff 393855.Dec 13 2021, 5:50 AM

peterwaller-arm accepted this revision.Dec 13 2021, 7:03 AM

This revision is now accepted and ready to land.Dec 13 2021, 7:03 AM

I've not hit the "Request Changes" button just in case I'm wrong but please don't commit the patch until my AArch64SVEPredPattern::all query is resolved one way or the other.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15344	Sounds reasonable to me. Given it's not clear cut perhaps it's worth adding a line after `// sunpklo(...` to highlight this combine works in partnership with `performSetCCPunpkCombine`?
16031	Is this required? I'm thinking that because the return type of `AArch64ISD::SETCC_MERGE_ZERO` is always a scalable `i1` vector, the previous `Extract->getValueType(0) != N->getValueType(0)` already has you covered here.
16032	FYI: I wasn't going to say this and you may well want to ignore this in the interest of time but I cannot convince myself this restriction is necessary. I say this because regardless of it's value the result will be a sequence of `PUNPK` instructions which should all do what we want. The downside is that we'll need more test coverage and given this code is a special case it's probably best to leave as is.
16039–16041	This comment doesn't match the code. The `punpklo` part should be `extract_subvector`? It's also worth copying this as a function comment because without it you need to jump here first before you can understand why the code above exists. Perhaps with a starting statement of "Remove redundant predicate trunc(sext()) sequences." Actually if you do that then I'm wondering if here words might better describe what is going on. I'm think something like By this point we've effectively got zero_inactive_lanes_and_trunc_i1(sext_i1(A)). If we can prove A's inactive lanes are already zero then the trunc(sext()) sequence is redundant and we can operate on A directly.
16042	Up to you but given you have `InnerSetCC` perhaps this should be `InnerPred`?
16045	Is this enough? I'm not totally sure but my initial reaction is "If this is `AArch64SVEPredPattern::all` then that means the number of active lanes for `InnerSetCC` will be different to `N`", so I'm thinking this needs to be restricted to the cases where a specific vector length pattern is used?

Add additional check that ptrues are not vl_all, and associated tests.

Improve comments describing what the punpk combine does in combination with
the other combine.

Ensure ptrue has a fixed vector length

Harbormaster completed remote builds in B139410: Diff 394523.Dec 15 2021, 4:43 AM

paulwalker-arm added inline comments.Dec 17 2021, 7:32 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15354–15356	This can be just `return DAG.getNode(ISD::SIGN_EXTEND...`
16021	To be consistent, either the above `punpklo` needs to be `extract_subvector` or this need should be `punpklo`.
16031	If you go with `punpklo` above then can you please add something like `// punpklo == extract_subvector` here.
llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
19–20 ↗	(On Diff #394523)	I think you can simplify all the tests in this file by removing the extending load part and returning the comparison result directly. I say this because the new DAG combines don't care about them and so they shouldn't be necessary to exercise them.
242 ↗	(On Diff #394523)	From the function name alone it's not obvious this is a negative test. Please add a small comment that highlights this is a negative test and why the optimisation is not applicable. I've put the comment here but it's a general one applicable to the other functions.

Add comments to negative tests explaining why the optimisation can't apply

Directly return predicate from tests rather than adding an extending load

paulwalker-arm accepted this revision.Dec 17 2021, 8:15 AM

paulwalker-arm added inline comments.

llvm/test/CodeGen/AArch64/sve-punpklo-combine.ll
246–251 ↗	(On Diff #395139)	These can be removed also?

This revision was landed with ongoing or failed builds.Dec 17 2021, 8:44 AM

Closed by commit rG7e44eb079d99: [AArch64][SVE] Improve code generation for VLS i1 masks (authored by DavidTruby). · Explain Why

This revision was automatically updated to reflect the committed changes.

DavidTruby added a commit: rG7e44eb079d99: [AArch64][SVE] Improve code generation for VLS i1 masks.

Harbormaster completed remote builds in B139851: Diff 395139.Dec 17 2021, 8:58 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

34 lines

test/

CodeGen/

AArch64/

sve-fixed-length-masked-loads.ll

263 lines

Diff 382634

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 27 Lines
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/Triple.h"		#include "llvm/ADT/Triple.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/Analysis/ObjCARCUtil.h"		#include "llvm/Analysis/ObjCARCUtil.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/CodeGen/Analysis.h"		#include "llvm/CodeGen/Analysis.h"
#include "llvm/CodeGen/CallingConvLower.h"		#include "llvm/CodeGen/CallingConvLower.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
		bsmithUnsubmitted Done Reply Inline Actions Rogue include? bsmith: Rogue include?
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineInstr.h"		#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineMemOperand.h"		#include "llvm/CodeGen/MachineMemOperand.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/RuntimeLibcalls.h"		#include "llvm/CodeGen/RuntimeLibcalls.h"
#include "llvm/CodeGen/SelectionDAG.h"		#include "llvm/CodeGen/SelectionDAG.h"
▲ Show 20 Lines • Show All 15,290 Lines • ▼ Show 20 Lines	static SDValue performVectorShiftCombine(SDNode *N,
APInt DemandedMask = ~ShiftedOutBits;		APInt DemandedMask = ~ShiftedOutBits;

if (TLI.SimplifyDemandedBits(Op, DemandedMask, DCI))		if (TLI.SimplifyDemandedBits(Op, DemandedMask, DCI))
return SDValue(N, 0);		return SDValue(N, 0);

return SDValue();		return SDValue();
}		}

		static SDValue performSunpkloCombine(SDNode *N, SelectionDAG &DAG) {
		// sunpklo (mov z, p/z, -1) => mov z, (punpklo p), -1
		peterwaller-armUnsubmitted Done Reply Inline Actions nit. As many of the other combines have, please could you introduce a comment showing the form of the intended combine? peterwaller-arm: nit. As many of the other combines have, please could you introduce a comment showing the form…
		paulwalker-armUnsubmitted Done Reply Inline Actions This comment doesn't match the code below. At the DAG level this is really `sunpklo(sext(pred)) -> sext(extract_low_half(pred))`. Also, is this transform universally good? As in, if the later transformation does not occur, is punpklo always better than sunpklo? paulwalker-arm: This comment doesn't match the code below. At the DAG level this is really `sunpklo(sext…
		DavidTrubyAuthorUnsubmitted Done Reply Inline Actions I'm not sure that the transform is universally good, however it does significantly simplify the following transform. This transform has the effect of moving the setcc that converts from a i1 register to another register after the sign_extend, which becomes just a single sign extend rather than a chain of unpacks of unknown length. This is a much easier pattern to recognize later. DavidTruby: I'm not sure that the transform is universally good, however it does significantly simplify the…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Sounds reasonable to me. Given it's not clear cut perhaps it's worth adding a line after `// sunpklo(...` to highlight this combine works in partnership with `performSetCCPunpkCombine`? paulwalker-arm: Sounds reasonable to me. Given it's not clear cut perhaps it's worth adding a line after `//…
		if (N->getOperand(0).getOpcode() == ISD::SIGN_EXTEND &&
		N->getOperand(0)->getOperand(0)->getValueType(0).getScalarType() ==
		MVT::i1) {
		SDValue CC = N->getOperand(0)->getOperand(0);
		auto VT = CC->getValueType(0).getHalfNumVectorElementsVT(*DAG.getContext());
		SDValue Unpk = DAG.getNode(ISD::EXTRACT_SUBVECTOR, SDLoc(N), VT, CC,
		DAG.getVectorIdxConstant(0, SDLoc(N)));
		SDValue Sext = DAG.getSExtOrTrunc(Unpk, SDLoc(N), N->getValueType(0));
		paulwalker-armUnsubmitted Done Reply Inline Actions Can you be explicit here given we know we always want a sign extend? paulwalker-arm: Can you be explicit here given we know we always want a sign extend?
		return Sext;
		}

		return SDValue();
		paulwalker-armUnsubmitted Not Done Reply Inline Actions This can be just `return DAG.getNode(ISD::SIGN_EXTEND...` paulwalker-arm: This can be just `return DAG.getNode(ISD::SIGN_EXTEND...`
		}

/// Target-specific DAG combine function for post-increment LD1 (lane) and		/// Target-specific DAG combine function for post-increment LD1 (lane) and
/// post-increment LD1R.		/// post-increment LD1R.
static SDValue performPostLD1Combine(SDNode *N,		static SDValue performPostLD1Combine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
bool IsLaneOp) {		bool IsLaneOp) {
if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 646 Lines • ▼ Show 20 Lines	static SDValue performSETCCCombine(SDNode *N, SelectionDAG &DAG) {
}		}

return SDValue();		return SDValue();
}		}

static SDValue performSetccMergeZeroCombine(SDNode *N, SelectionDAG &DAG) {		static SDValue performSetccMergeZeroCombine(SDNode *N, SelectionDAG &DAG) {
assert(N->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&		assert(N->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&
"Unexpected opcode!");		"Unexpected opcode!");

		paulwalker-armUnsubmitted Not Done Reply Inline Actions To be consistent, either the above `punpklo` needs to be `extract_subvector` or this need should be `punpklo`. paulwalker-arm: To be consistent, either the above `punpklo` needs to be `extract_subvector` or this need…
SDValue Pred = N->getOperand(0);		SDValue Pred = N->getOperand(0);
SDValue LHS = N->getOperand(1);		SDValue LHS = N->getOperand(1);
SDValue RHS = N->getOperand(2);		SDValue RHS = N->getOperand(2);
ISD::CondCode Cond = cast<CondCodeSDNode>(N->getOperand(3))->get();		ISD::CondCode Cond = cast<CondCodeSDNode>(N->getOperand(3))->get();

// setcc_merge_zero pred (sign_extend (setcc_merge_zero ... pred ...)), 0, ne		// setcc_merge_zero pred (sign_extend (setcc_merge_zero ... pred ...)), 0, ne
// => inner setcc_merge_zero		// => inner setcc_merge_zero
if (Cond == ISD::SETNE && isZerosVector(RHS.getNode()) &&		if (Cond == ISD::SETNE && isZerosVector(RHS.getNode()) &&
LHS->getOpcode() == ISD::SIGN_EXTEND &&		LHS->getOpcode() == ISD::SIGN_EXTEND &&
LHS->getOperand(0)->getValueType(0) == N->getValueType(0) &&		LHS->getOperand(0)->getValueType(0) == N->getValueType(0) &&
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Is this required? I'm thinking that because the return type of `AArch64ISD::SETCC_MERGE_ZERO` is always a scalable `i1` vector, the previous `Extract->getValueType(0) != N->getValueType(0)` already has you covered here. paulwalker-arm: Is this required? I'm thinking that because the return type of `AArch64ISD::SETCC_MERGE_ZERO`…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions If you go with `punpklo` above then can you please add something like `// punpklo == extract_subvector` here. paulwalker-arm: If you go with `punpklo` above then can you please add something like `// punpklo ==…
LHS->getOperand(0)->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&		LHS->getOperand(0)->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&
		paulwalker-armUnsubmitted Not Done Reply Inline Actions FYI: I wasn't going to say this and you may well want to ignore this in the interest of time but I cannot convince myself this restriction is necessary. I say this because regardless of it's value the result will be a sequence of `PUNPK` instructions which should all do what we want. The downside is that we'll need more test coverage and given this code is a special case it's probably best to leave as is. paulwalker-arm: FYI: I wasn't going to say this and you may well want to ignore this in the interest of time…
LHS->getOperand(0)->getOperand(0) == Pred)		LHS->getOperand(0)->getOperand(0) == Pred)
return LHS->getOperand(0);		return LHS->getOperand(0);

		// setcc_merge_zero pred
		peterwaller-armUnsubmitted Done Reply Inline Actions nit. As many of the other combines have, please could you introduce a comment showing the form of the combine? peterwaller-arm: nit. As many of the other combines have, please could you introduce a comment showing the form…
		// (sign_extend (punpklo (setcc_merge_zero ... pred ...))), 0, ne
		// => punpklo (inner setcc_merge_zero)
		if (Cond == ISD::SETNE && isZerosVector(RHS.getNode()) &&
		LHS->getOpcode() == ISD::SIGN_EXTEND &&
		LHS->getOperand(0)->getValueType(0) == N->getValueType(0) &&
		paulwalker-armUnsubmitted Not Done Reply Inline Actions This comment doesn't match the code. The `punpklo` part should be `extract_subvector`? It's also worth copying this as a function comment because without it you need to jump here first before you can understand why the code above exists. Perhaps with a starting statement of "Remove redundant predicate trunc(sext()) sequences." Actually if you do that then I'm wondering if here words might better describe what is going on. I'm think something like By this point we've effectively got zero_inactive_lanes_and_trunc_i1(sext_i1(A)). If we can prove A's inactive lanes are already zero then the trunc(sext()) sequence is redundant and we can operate on A directly. paulwalker-arm: This comment doesn't match the code. The `punpklo` part should be `extract_subvector`? It's…
		LHS->getOperand(0)->getOpcode() == ISD::EXTRACT_SUBVECTOR &&
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Up to you but given you have `InnerSetCC` perhaps this should be `InnerPred`? paulwalker-arm: Up to you but given you have `InnerSetCC` perhaps this should be `InnerPred`?
		LHS->getOperand(0).getValueType().getScalarType() == MVT::i1 &&
		LHS->getOperand(0)->getConstantOperandVal(1) == 0) {
		paulwalker-armUnsubmitted Done Reply Inline Actions I believe this transform is sound for a normal `SETCC` but `N` is a `SETCC_MERGE_ZERO` and so consideration must be paid to the predicate to ensure the inactive lanes get zero'd. The easiest way to make the combine safe is to check `Pred` is all active. The problem with doing that is it'll probably mean the combine doesn't fire for the case you care about. paulwalker-arm: I believe this transform is sound for a normal `SETCC` but `N` is a `SETCC_MERGE_ZERO` and so…
		auto OrigPred = LHS->getOperand(0)->getOperand(0)->getOperand(0);
		paulwalker-armUnsubmitted Done Reply Inline Actions Perhaps I've confused myself with the number of `getOperand()` calls but this doesn't looks quite right: LHS == ISD::SIGN_EXTEND LHS->getOperand(0) == ISD::EXTRACT_SUBVECTOR LHS->getOperand(0)->getOperand(0) == vector we're extracting the subvector from LHS->getOperand(0)->getOperand(0)->getOperand(0) == ???? and how do you know LHS->getOperand(0)->getOperand(0) has an operand to get? I can see the following code then ensures `LHS->getOperand(0)->getOperand(0)->getOperand(0)` is a `PTRUE` but before then depending on `LHS->getOperand(0)->getOperand(0)` the extra call to `getOperand(0)` might assert. Also, is it really the case that is doesn't matter how the `PTRUE` is processed before being passed to `ISD::EXTRACT_SUBVECTOR`? paulwalker-arm: Perhaps I've confused myself with the number of `getOperand()` calls but this doesn't looks…
		DavidTrubyAuthorUnsubmitted Done Reply Inline Actions I've refactored all of this to make it a bit clearer what's going on, and added an additional check so that we can make sure that `getOperand(0)` actually exists. DavidTruby: I've refactored all of this to make it a bit clearer what's going on, and added an additional…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Is this enough? I'm not totally sure but my initial reaction is "If this is `AArch64SVEPredPattern::all` then that means the number of active lanes for `InnerSetCC` will be different to `N`", so I'm thinking this needs to be restricted to the cases where a specific vector length pattern is used? paulwalker-arm: Is this enough? I'm not totally sure but my initial reaction is "If this is…
		if (Pred.getOpcode() == AArch64ISD::PTRUE &&
		bsmithUnsubmitted Done Reply Inline Actions Is this assuming that Pred and OrigPred and ptrues? If they're not, won't this break? bsmith: Is this assuming that Pred and OrigPred and ptrues? If they're not, won't this break?
		OrigPred.getOpcode() == AArch64ISD::PTRUE &&
		Pred.getConstantOperandVal(0) == OrigPred.getConstantOperandVal(0))
		return LHS->getOperand(0);
		}

return SDValue();		return SDValue();
}		}

// Optimize some simple tbz/tbnz cases. Returns the new operand and bit to test		// Optimize some simple tbz/tbnz cases. Returns the new operand and bit to test
// as well as whether the test should be inverted. This code is required to		// as well as whether the test should be inverted. This code is required to
// catch these cases (as opposed to standard dag combines) because		// catch these cases (as opposed to standard dag combines) because
// AArch64ISD::TBZ is matched during legalization.		// AArch64ISD::TBZ is matched during legalization.
static SDValue getTestBitOperand(SDValue Op, unsigned &Bit, bool &Invert,		static SDValue getTestBitOperand(SDValue Op, unsigned &Bit, bool &Invert,
▲ Show 20 Lines • Show All 890 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case AArch64ISD::GLD1S_SXTW_MERGE_ZERO:		case AArch64ISD::GLD1S_SXTW_MERGE_ZERO:
case AArch64ISD::GLD1S_UXTW_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1S_UXTW_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1S_SXTW_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1S_SXTW_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1S_IMM_MERGE_ZERO:		case AArch64ISD::GLD1S_IMM_MERGE_ZERO:
return performGLD1Combine(N, DAG);		return performGLD1Combine(N, DAG);
case AArch64ISD::VASHR:		case AArch64ISD::VASHR:
case AArch64ISD::VLSHR:		case AArch64ISD::VLSHR:
return performVectorShiftCombine(N, *this, DCI);		return performVectorShiftCombine(N, *this, DCI);
		case AArch64ISD::SUNPKLO:
		return performSunpkloCombine(N, DAG);
case ISD::INSERT_VECTOR_ELT:		case ISD::INSERT_VECTOR_ELT:
return performInsertVectorEltCombine(N, DCI);		return performInsertVectorEltCombine(N, DCI);
case ISD::EXTRACT_VECTOR_ELT:		case ISD::EXTRACT_VECTOR_ELT:
return performExtractVectorEltCombine(N, DAG);		return performExtractVectorEltCombine(N, DAG);
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
return performVecReduceAddCombine(N, DCI.DAG, Subtarget);		return performVecReduceAddCombine(N, DCI.DAG, Subtarget);
case ISD::INTRINSIC_VOID:		case ISD::INTRINSIC_VOID:
case ISD::INTRINSIC_W_CHAIN:		case ISD::INTRINSIC_W_CHAIN:
▲ Show 20 Lines • Show All 2,073 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-masked-loads.ll

	Show First 20 Lines • Show All 262 Lines • ▼ Show 20 Lines
	}			}

	define <32 x i16> @masked_load_sext_v32i8i16(<32 x i8>* %ap, <32 x i8>* %bp) #0 {			define <32 x i16> @masked_load_sext_v32i8i16(<32 x i8>* %ap, <32 x i8>* %bp) #0 {
	; VBITS_GE_512-LABEL: masked_load_sext_v32i8i16:			; VBITS_GE_512-LABEL: masked_load_sext_v32i8i16:
	; VBITS_GE_512: // %bb.0:			; VBITS_GE_512: // %bb.0:
	; VBITS_GE_512-NEXT: ptrue p0.b, vl32			; VBITS_GE_512-NEXT: ptrue p0.b, vl32
	; VBITS_GE_512-NEXT: ld1b { z0.b }, p0/z, [x1]			; VBITS_GE_512-NEXT: ld1b { z0.b }, p0/z, [x1]
	; VBITS_GE_512-NEXT: cmpeq p0.b, p0/z, z0.b, #0			; VBITS_GE_512-NEXT: cmpeq p0.b, p0/z, z0.b, #0
	; VBITS_GE_512-NEXT: mov z0.b, p0/z, #-1 // =0xffffffffffffffff			; VBITS_GE_512-NEXT: punpklo p0.h, p0.b
	; VBITS_GE_512-NEXT: sunpklo z0.h, z0.b			; VBITS_GE_512-NEXT: ld1sb { z0.h }, p0/z, [x0]
	; VBITS_GE_512-NEXT: ptrue p0.h, vl32			; VBITS_GE_512-NEXT: ptrue p0.h, vl32
	; VBITS_GE_512-NEXT: cmpne p1.h, p0/z, z0.h, #0
	; VBITS_GE_512-NEXT: ld1sb { z0.h }, p1/z, [x0]
	; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x8]			; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x8]
	; VBITS_GE_512-NEXT: ret			; VBITS_GE_512-NEXT: ret
	%b = load <32 x i8>, <32 x i8>* %bp			%b = load <32 x i8>, <32 x i8>* %bp
	%mask = icmp eq <32 x i8> %b, zeroinitializer			%mask = icmp eq <32 x i8> %b, zeroinitializer
	%load = call <32 x i8> @llvm.masked.load.v32i8(<32 x i8>* %ap, i32 8, <32 x i1> %mask, <32 x i8> undef)			%load = call <32 x i8> @llvm.masked.load.v32i8(<32 x i8>* %ap, i32 8, <32 x i1> %mask, <32 x i8> undef)
	%ext = sext <32 x i8> %load to <32 x i16>			%ext = sext <32 x i8> %load to <32 x i16>
	ret <32 x i16> %ext			ret <32 x i16> %ext
	}			}
	Show All 38 Lines
	}			}

	define <16 x i32> @masked_load_sext_v16i16i32(<16 x i16>* %ap, <16 x i16>* %bp) #0 {			define <16 x i32> @masked_load_sext_v16i16i32(<16 x i16>* %ap, <16 x i16>* %bp) #0 {
	; VBITS_GE_512-LABEL: masked_load_sext_v16i16i32:			; VBITS_GE_512-LABEL: masked_load_sext_v16i16i32:
	; VBITS_GE_512: // %bb.0:			; VBITS_GE_512: // %bb.0:
	; VBITS_GE_512-NEXT: ptrue p0.h, vl16			; VBITS_GE_512-NEXT: ptrue p0.h, vl16
	; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x1]			; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x1]
	; VBITS_GE_512-NEXT: cmpeq p0.h, p0/z, z0.h, #0			; VBITS_GE_512-NEXT: cmpeq p0.h, p0/z, z0.h, #0
	; VBITS_GE_512-NEXT: mov z0.h, p0/z, #-1 // =0xffffffffffffffff			; VBITS_GE_512-NEXT: punpklo p0.h, p0.b
	; VBITS_GE_512-NEXT: sunpklo z0.s, z0.h			; VBITS_GE_512-NEXT: ld1sh { z0.s }, p0/z, [x0]
	; VBITS_GE_512-NEXT: ptrue p0.s, vl16			; VBITS_GE_512-NEXT: ptrue p0.s, vl16
	; VBITS_GE_512-NEXT: cmpne p1.s, p0/z, z0.s, #0
	; VBITS_GE_512-NEXT: ld1sh { z0.s }, p1/z, [x0]
	; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x8]			; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x8]
	; VBITS_GE_512-NEXT: ret			; VBITS_GE_512-NEXT: ret
	%b = load <16 x i16>, <16 x i16>* %bp			%b = load <16 x i16>, <16 x i16>* %bp
	%mask = icmp eq <16 x i16> %b, zeroinitializer			%mask = icmp eq <16 x i16> %b, zeroinitializer
	%load = call <16 x i16> @llvm.masked.load.v16i16(<16 x i16>* %ap, i32 8, <16 x i1> %mask, <16 x i16> undef)			%load = call <16 x i16> @llvm.masked.load.v16i16(<16 x i16>* %ap, i32 8, <16 x i1> %mask, <16 x i16> undef)
	%ext = sext <16 x i16> %load to <16 x i32>			%ext = sext <16 x i16> %load to <16 x i32>
	ret <16 x i32> %ext			ret <16 x i32> %ext
	}			}
	Show All 18 Lines
	}			}

	define <8 x i64> @masked_load_sext_v8i32i64(<8 x i32>* %ap, <8 x i32>* %bp) #0 {			define <8 x i64> @masked_load_sext_v8i32i64(<8 x i32>* %ap, <8 x i32>* %bp) #0 {
	; VBITS_GE_512-LABEL: masked_load_sext_v8i32i64:			; VBITS_GE_512-LABEL: masked_load_sext_v8i32i64:
	; VBITS_GE_512: // %bb.0:			; VBITS_GE_512: // %bb.0:
	; VBITS_GE_512-NEXT: ptrue p0.s, vl8			; VBITS_GE_512-NEXT: ptrue p0.s, vl8
	; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x1]			; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x1]
	; VBITS_GE_512-NEXT: cmpeq p0.s, p0/z, z0.s, #0			; VBITS_GE_512-NEXT: cmpeq p0.s, p0/z, z0.s, #0
	; VBITS_GE_512-NEXT: mov z0.s, p0/z, #-1 // =0xffffffffffffffff			; VBITS_GE_512-NEXT: punpklo p0.h, p0.b
	; VBITS_GE_512-NEXT: sunpklo z0.d, z0.s			; VBITS_GE_512-NEXT: ld1sw { z0.d }, p0/z, [x0]
	; VBITS_GE_512-NEXT: ptrue p0.d, vl8			; VBITS_GE_512-NEXT: ptrue p0.d, vl8
	; VBITS_GE_512-NEXT: cmpne p1.d, p0/z, z0.d, #0
	; VBITS_GE_512-NEXT: ld1sw { z0.d }, p1/z, [x0]
	; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x8]			; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x8]
	; VBITS_GE_512-NEXT: ret			; VBITS_GE_512-NEXT: ret
	%b = load <8 x i32>, <8 x i32>* %bp			%b = load <8 x i32>, <8 x i32>* %bp
	%mask = icmp eq <8 x i32> %b, zeroinitializer			%mask = icmp eq <8 x i32> %b, zeroinitializer
	%load = call <8 x i32> @llvm.masked.load.v8i32(<8 x i32>* %ap, i32 8, <8 x i1> %mask, <8 x i32> undef)			%load = call <8 x i32> @llvm.masked.load.v8i32(<8 x i32>* %ap, i32 8, <8 x i1> %mask, <8 x i32> undef)
	%ext = sext <8 x i32> %load to <8 x i64>			%ext = sext <8 x i32> %load to <8 x i64>
	ret <8 x i64> %ext			ret <8 x i64> %ext
	}			}

	define <32 x i16> @masked_load_zext_v32i8i16(<32 x i8>* %ap, <32 x i8>* %bp) #0 {			define <32 x i16> @masked_load_zext_v32i8i16(<32 x i8>* %ap, <32 x i8>* %bp) #0 {
	; VBITS_GE_512-LABEL: masked_load_zext_v32i8i16:			; VBITS_GE_512-LABEL: masked_load_zext_v32i8i16:
	; VBITS_GE_512: // %bb.0:			; VBITS_GE_512: // %bb.0:
	; VBITS_GE_512-NEXT: ptrue p0.b, vl32			; VBITS_GE_512-NEXT: ptrue p0.b, vl32
	; VBITS_GE_512-NEXT: ld1b { z0.b }, p0/z, [x1]			; VBITS_GE_512-NEXT: ld1b { z0.b }, p0/z, [x1]
	; VBITS_GE_512-NEXT: cmpeq p0.b, p0/z, z0.b, #0			; VBITS_GE_512-NEXT: cmpeq p0.b, p0/z, z0.b, #0
	; VBITS_GE_512-NEXT: mov z0.b, p0/z, #-1 // =0xffffffffffffffff			; VBITS_GE_512-NEXT: punpklo p0.h, p0.b
	; VBITS_GE_512-NEXT: sunpklo z0.h, z0.b			; VBITS_GE_512-NEXT: ld1b { z0.h }, p0/z, [x0]
	; VBITS_GE_512-NEXT: ptrue p0.h, vl32			; VBITS_GE_512-NEXT: ptrue p0.h, vl32
	; VBITS_GE_512-NEXT: cmpne p1.h, p0/z, z0.h, #0
	; VBITS_GE_512-NEXT: ld1b { z0.h }, p1/z, [x0]
	; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x8]			; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x8]
	; VBITS_GE_512-NEXT: ret			; VBITS_GE_512-NEXT: ret
	%b = load <32 x i8>, <32 x i8>* %bp			%b = load <32 x i8>, <32 x i8>* %bp
	%mask = icmp eq <32 x i8> %b, zeroinitializer			%mask = icmp eq <32 x i8> %b, zeroinitializer
	%load = call <32 x i8> @llvm.masked.load.v32i8(<32 x i8>* %ap, i32 8, <32 x i1> %mask, <32 x i8> undef)			%load = call <32 x i8> @llvm.masked.load.v32i8(<32 x i8>* %ap, i32 8, <32 x i1> %mask, <32 x i8> undef)
	%ext = zext <32 x i8> %load to <32 x i16>			%ext = zext <32 x i8> %load to <32 x i16>
	ret <32 x i16> %ext			ret <32 x i16> %ext
	}			}
	Show All 38 Lines
	}			}

	define <16 x i32> @masked_load_zext_v16i16i32(<16 x i16>* %ap, <16 x i16>* %bp) #0 {			define <16 x i32> @masked_load_zext_v16i16i32(<16 x i16>* %ap, <16 x i16>* %bp) #0 {
	; VBITS_GE_512-LABEL: masked_load_zext_v16i16i32:			; VBITS_GE_512-LABEL: masked_load_zext_v16i16i32:
	; VBITS_GE_512: // %bb.0:			; VBITS_GE_512: // %bb.0:
	; VBITS_GE_512-NEXT: ptrue p0.h, vl16			; VBITS_GE_512-NEXT: ptrue p0.h, vl16
	; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x1]			; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x1]
	; VBITS_GE_512-NEXT: cmpeq p0.h, p0/z, z0.h, #0			; VBITS_GE_512-NEXT: cmpeq p0.h, p0/z, z0.h, #0
	; VBITS_GE_512-NEXT: mov z0.h, p0/z, #-1 // =0xffffffffffffffff			; VBITS_GE_512-NEXT: punpklo p0.h, p0.b
	; VBITS_GE_512-NEXT: sunpklo z0.s, z0.h			; VBITS_GE_512-NEXT: ld1h { z0.s }, p0/z, [x0]
	; VBITS_GE_512-NEXT: ptrue p0.s, vl16			; VBITS_GE_512-NEXT: ptrue p0.s, vl16
	; VBITS_GE_512-NEXT: cmpne p1.s, p0/z, z0.s, #0
	; VBITS_GE_512-NEXT: ld1h { z0.s }, p1/z, [x0]
	; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x8]			; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x8]
	; VBITS_GE_512-NEXT: ret			; VBITS_GE_512-NEXT: ret
	%b = load <16 x i16>, <16 x i16>* %bp			%b = load <16 x i16>, <16 x i16>* %bp
	%mask = icmp eq <16 x i16> %b, zeroinitializer			%mask = icmp eq <16 x i16> %b, zeroinitializer
	%load = call <16 x i16> @llvm.masked.load.v16i16(<16 x i16>* %ap, i32 8, <16 x i1> %mask, <16 x i16> undef)			%load = call <16 x i16> @llvm.masked.load.v16i16(<16 x i16>* %ap, i32 8, <16 x i1> %mask, <16 x i16> undef)
	%ext = zext <16 x i16> %load to <16 x i32>			%ext = zext <16 x i16> %load to <16 x i32>
	ret <16 x i32> %ext			ret <16 x i32> %ext
	}			}
	Show All 18 Lines
	}			}

	define <8 x i64> @masked_load_zext_v8i32i64(<8 x i32>* %ap, <8 x i32>* %bp) #0 {			define <8 x i64> @masked_load_zext_v8i32i64(<8 x i32>* %ap, <8 x i32>* %bp) #0 {
	; VBITS_GE_512-LABEL: masked_load_zext_v8i32i64:			; VBITS_GE_512-LABEL: masked_load_zext_v8i32i64:
	; VBITS_GE_512: // %bb.0:			; VBITS_GE_512: // %bb.0:
	; VBITS_GE_512-NEXT: ptrue p0.s, vl8			; VBITS_GE_512-NEXT: ptrue p0.s, vl8
	; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x1]			; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x1]
	; VBITS_GE_512-NEXT: cmpeq p0.s, p0/z, z0.s, #0			; VBITS_GE_512-NEXT: cmpeq p0.s, p0/z, z0.s, #0
	; VBITS_GE_512-NEXT: mov z0.s, p0/z, #-1 // =0xffffffffffffffff			; VBITS_GE_512-NEXT: punpklo p0.h, p0.b
	; VBITS_GE_512-NEXT: sunpklo z0.d, z0.s			; VBITS_GE_512-NEXT: ld1w { z0.d }, p0/z, [x0]
	; VBITS_GE_512-NEXT: ptrue p0.d, vl8			; VBITS_GE_512-NEXT: ptrue p0.d, vl8
	; VBITS_GE_512-NEXT: cmpne p1.d, p0/z, z0.d, #0
	; VBITS_GE_512-NEXT: ld1w { z0.d }, p1/z, [x0]
	; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x8]			; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x8]
	; VBITS_GE_512-NEXT: ret			; VBITS_GE_512-NEXT: ret
	%b = load <8 x i32>, <8 x i32>* %bp			%b = load <8 x i32>, <8 x i32>* %bp
	%mask = icmp eq <8 x i32> %b, zeroinitializer			%mask = icmp eq <8 x i32> %b, zeroinitializer
	%load = call <8 x i32> @llvm.masked.load.v8i32(<8 x i32>* %ap, i32 8, <8 x i1> %mask, <8 x i32> undef)			%load = call <8 x i32> @llvm.masked.load.v8i32(<8 x i32>* %ap, i32 8, <8 x i1> %mask, <8 x i32> undef)
	%ext = zext <8 x i32> %load to <8 x i64>			%ext = zext <8 x i32> %load to <8 x i64>
	ret <8 x i64> %ext			ret <8 x i64> %ext
	}			}
	▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines
	; VBITS_GE_512-NEXT: ret			; VBITS_GE_512-NEXT: ret
	%b = load <8 x i64>, <8 x i64>* %bp			%b = load <8 x i64>, <8 x i64>* %bp
	%mask = icmp eq <8 x i64> %b, zeroinitializer			%mask = icmp eq <8 x i64> %b, zeroinitializer
	%load = call <8 x i32> @llvm.masked.load.v8i32(<8 x i32>* %ap, i32 8, <8 x i1> %mask, <8 x i32> undef)			%load = call <8 x i32> @llvm.masked.load.v8i32(<8 x i32>* %ap, i32 8, <8 x i1> %mask, <8 x i32> undef)
	%ext = zext <8 x i32> %load to <8 x i64>			%ext = zext <8 x i32> %load to <8 x i64>
	ret <8 x i64> %ext			ret <8 x i64> %ext
	}			}

				define <128 x i16> @masked_load_sext_v128i8i16(<128 x i8>* %ap, <128 x i8>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_sext_v128i8i16:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.b, vl128
				; VBITS_GE_2048-NEXT: ld1b { z0.b }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.b, p0/z, z0.b, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1sb { z0.h }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.h, vl128
				; VBITS_GE_2048-NEXT: st1h { z0.h }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <128 x i8>, <128 x i8>* %bp
				%mask = icmp eq <128 x i8> %b, zeroinitializer
				%load = call <128 x i8> @llvm.masked.load.v128i8(<128 x i8>* %ap, i32 8, <128 x i1> %mask, <128 x i8> undef)
				%ext = sext <128 x i8> %load to <128 x i16>
				ret <128 x i16> %ext
				}

				define <64 x i32> @masked_load_sext_v64i8i32(<64 x i8>* %ap, <64 x i8>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_sext_v64i8i32:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.b, vl64
				; VBITS_GE_2048-NEXT: ld1b { z0.b }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.b, p0/z, z0.b, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1sb { z0.s }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.s, vl64
				; VBITS_GE_2048-NEXT: st1w { z0.s }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <64 x i8>, <64 x i8>* %bp
				%mask = icmp eq <64 x i8> %b, zeroinitializer
				%load = call <64 x i8> @llvm.masked.load.v64i8(<64 x i8>* %ap, i32 8, <64 x i1> %mask, <64 x i8> undef)
				%ext = sext <64 x i8> %load to <64 x i32>
				ret <64 x i32> %ext
				}

				define <32 x i64> @masked_load_sext_v32i8i64(<32 x i8>* %ap, <32 x i8>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_sext_v32i8i64:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.b, vl32
				; VBITS_GE_2048-NEXT: ld1b { z0.b }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.b, p0/z, z0.b, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1sb { z0.d }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.d, vl32
				; VBITS_GE_2048-NEXT: st1d { z0.d }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <32 x i8>, <32 x i8>* %bp
				%mask = icmp eq <32 x i8> %b, zeroinitializer
				%load = call <32 x i8> @llvm.masked.load.v32i8(<32 x i8>* %ap, i32 8, <32 x i1> %mask, <32 x i8> undef)
				%ext = sext <32 x i8> %load to <32 x i64>
				ret <32 x i64> %ext
				}

				define <64 x i32> @masked_load_sext_v64i16i32(<64 x i16>* %ap, <64 x i16>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_sext_v64i16i32:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.h, vl64
				; VBITS_GE_2048-NEXT: ld1h { z0.h }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.h, p0/z, z0.h, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1sh { z0.s }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.s, vl64
				; VBITS_GE_2048-NEXT: st1w { z0.s }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <64 x i16>, <64 x i16>* %bp
				%mask = icmp eq <64 x i16> %b, zeroinitializer
				%load = call <64 x i16> @llvm.masked.load.v64i16(<64 x i16>* %ap, i32 8, <64 x i1> %mask, <64 x i16> undef)
				%ext = sext <64 x i16> %load to <64 x i32>
				ret <64 x i32> %ext
				}

				define <32 x i64> @masked_load_sext_v32i16i64(<32 x i16>* %ap, <32 x i16>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_sext_v32i16i64:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.h, vl32
				; VBITS_GE_2048-NEXT: ld1h { z0.h }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.h, p0/z, z0.h, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1sh { z0.d }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.d, vl32
				; VBITS_GE_2048-NEXT: st1d { z0.d }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <32 x i16>, <32 x i16>* %bp
				%mask = icmp eq <32 x i16> %b, zeroinitializer
				%load = call <32 x i16> @llvm.masked.load.v32i16(<32 x i16>* %ap, i32 8, <32 x i1> %mask, <32 x i16> undef)
				%ext = sext <32 x i16> %load to <32 x i64>
				ret <32 x i64> %ext
				}

				define <32 x i64> @masked_load_sext_v32i32i64(<32 x i32>* %ap, <32 x i32>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_sext_v32i32i64:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.s, vl32
				; VBITS_GE_2048-NEXT: ld1w { z0.s }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.s, p0/z, z0.s, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1sw { z0.d }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.d, vl32
				; VBITS_GE_2048-NEXT: st1d { z0.d }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <32 x i32>, <32 x i32>* %bp
				%mask = icmp eq <32 x i32> %b, zeroinitializer
				%load = call <32 x i32> @llvm.masked.load.v32i32(<32 x i32>* %ap, i32 8, <32 x i1> %mask, <32 x i32> undef)
				%ext = sext <32 x i32> %load to <32 x i64>
				ret <32 x i64> %ext
				}

				define <128 x i16> @masked_load_zext_v128i8i16(<128 x i8>* %ap, <128 x i8>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_zext_v128i8i16:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.b, vl128
				; VBITS_GE_2048-NEXT: ld1b { z0.b }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.b, p0/z, z0.b, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1b { z0.h }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.h, vl128
				; VBITS_GE_2048-NEXT: st1h { z0.h }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <128 x i8>, <128 x i8>* %bp
				%mask = icmp eq <128 x i8> %b, zeroinitializer
				%load = call <128 x i8> @llvm.masked.load.v128i8(<128 x i8>* %ap, i32 8, <128 x i1> %mask, <128 x i8> undef)
				%ext = zext <128 x i8> %load to <128 x i16>
				ret <128 x i16> %ext
				}

				define <64 x i32> @masked_load_zext_v64i8i32(<64 x i8>* %ap, <64 x i8>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_zext_v64i8i32:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.b, vl64
				; VBITS_GE_2048-NEXT: ld1b { z0.b }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.b, p0/z, z0.b, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1b { z0.s }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.s, vl64
				; VBITS_GE_2048-NEXT: st1w { z0.s }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <64 x i8>, <64 x i8>* %bp
				%mask = icmp eq <64 x i8> %b, zeroinitializer
				%load = call <64 x i8> @llvm.masked.load.v64i8(<64 x i8>* %ap, i32 8, <64 x i1> %mask, <64 x i8> undef)
				%ext = zext <64 x i8> %load to <64 x i32>
				ret <64 x i32> %ext
				}

				define <32 x i64> @masked_load_zext_v32i8i64(<32 x i8>* %ap, <32 x i8>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_zext_v32i8i64:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.b, vl32
				; VBITS_GE_2048-NEXT: ld1b { z0.b }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.b, p0/z, z0.b, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1b { z0.d }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.d, vl32
				; VBITS_GE_2048-NEXT: st1d { z0.d }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <32 x i8>, <32 x i8>* %bp
				%mask = icmp eq <32 x i8> %b, zeroinitializer
				%load = call <32 x i8> @llvm.masked.load.v32i8(<32 x i8>* %ap, i32 8, <32 x i1> %mask, <32 x i8> undef)
				%ext = zext <32 x i8> %load to <32 x i64>
				ret <32 x i64> %ext
				}

				define <64 x i32> @masked_load_zext_v64i16i32(<64 x i16>* %ap, <64 x i16>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_zext_v64i16i32:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.h, vl64
				; VBITS_GE_2048-NEXT: ld1h { z0.h }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.h, p0/z, z0.h, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1h { z0.s }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.s, vl64
				; VBITS_GE_2048-NEXT: st1w { z0.s }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <64 x i16>, <64 x i16>* %bp
				%mask = icmp eq <64 x i16> %b, zeroinitializer
				%load = call <64 x i16> @llvm.masked.load.v64i16(<64 x i16>* %ap, i32 8, <64 x i1> %mask, <64 x i16> undef)
				%ext = zext <64 x i16> %load to <64 x i32>
				ret <64 x i32> %ext
				}

				define <32 x i64> @masked_load_zext_v32i16i64(<32 x i16>* %ap, <32 x i16>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_zext_v32i16i64:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.h, vl32
				; VBITS_GE_2048-NEXT: ld1h { z0.h }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.h, p0/z, z0.h, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1h { z0.d }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.d, vl32
				; VBITS_GE_2048-NEXT: st1d { z0.d }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <32 x i16>, <32 x i16>* %bp
				%mask = icmp eq <32 x i16> %b, zeroinitializer
				%load = call <32 x i16> @llvm.masked.load.v32i16(<32 x i16>* %ap, i32 8, <32 x i1> %mask, <32 x i16> undef)
				%ext = zext <32 x i16> %load to <32 x i64>
				ret <32 x i64> %ext
				}

				define <32 x i64> @masked_load_zext_v32i32i64(<32 x i32>* %ap, <32 x i32>* %bp) #0 {
				; VBITS_GE_2048-LABEL: masked_load_zext_v32i32i64:
				; VBITS_GE_2048: // %bb.0:
				; VBITS_GE_2048-NEXT: ptrue p0.s, vl32
				; VBITS_GE_2048-NEXT: ld1w { z0.s }, p0/z, [x1]
				; VBITS_GE_2048-NEXT: cmpeq p0.s, p0/z, z0.s, #0
				; VBITS_GE_2048-NEXT: punpklo p0.h, p0.b
				; VBITS_GE_2048-NEXT: ld1w { z0.d }, p0/z, [x0]
				; VBITS_GE_2048-NEXT: ptrue p0.d, vl32
				; VBITS_GE_2048-NEXT: st1d { z0.d }, p0, [x8]
				; VBITS_GE_2048-NEXT: ret
				%b = load <32 x i32>, <32 x i32>* %bp
				%mask = icmp eq <32 x i32> %b, zeroinitializer
				%load = call <32 x i32> @llvm.masked.load.v32i32(<32 x i32>* %ap, i32 8, <32 x i1> %mask, <32 x i32> undef)
				%ext = zext <32 x i32> %load to <32 x i64>
				ret <32 x i64> %ext
				}

	declare <2 x half> @llvm.masked.load.v2f16(<2 x half>*, i32, <2 x i1>, <2 x half>)			declare <2 x half> @llvm.masked.load.v2f16(<2 x half>*, i32, <2 x i1>, <2 x half>)
	declare <2 x float> @llvm.masked.load.v2f32(<2 x float>*, i32, <2 x i1>, <2 x float>)			declare <2 x float> @llvm.masked.load.v2f32(<2 x float>*, i32, <2 x i1>, <2 x float>)
	declare <4 x float> @llvm.masked.load.v4f32(<4 x float>*, i32, <4 x i1>, <4 x float>)			declare <4 x float> @llvm.masked.load.v4f32(<4 x float>*, i32, <4 x i1>, <4 x float>)
	declare <8 x float> @llvm.masked.load.v8f32(<8 x float>*, i32, <8 x i1>, <8 x float>)			declare <8 x float> @llvm.masked.load.v8f32(<8 x float>*, i32, <8 x i1>, <8 x float>)
	declare <16 x float> @llvm.masked.load.v16f32(<16 x float>*, i32, <16 x i1>, <16 x float>)			declare <16 x float> @llvm.masked.load.v16f32(<16 x float>*, i32, <16 x i1>, <16 x float>)
	declare <32 x float> @llvm.masked.load.v32f32(<32 x float>*, i32, <32 x i1>, <32 x float>)			declare <32 x float> @llvm.masked.load.v32f32(<32 x float>*, i32, <32 x i1>, <32 x float>)
	declare <64 x float> @llvm.masked.load.v64f32(<64 x float>*, i32, <64 x i1>, <64 x float>)			declare <64 x float> @llvm.masked.load.v64f32(<64 x float>*, i32, <64 x i1>, <64 x float>)

				declare <128 x i8> @llvm.masked.load.v128i8(<128 x i8>*, i32, <128 x i1>, <128 x i8>)
	declare <64 x i8> @llvm.masked.load.v64i8(<64 x i8>*, i32, <64 x i1>, <64 x i8>)			declare <64 x i8> @llvm.masked.load.v64i8(<64 x i8>*, i32, <64 x i1>, <64 x i8>)
	declare <32 x i8> @llvm.masked.load.v32i8(<32 x i8>*, i32, <32 x i1>, <32 x i8>)			declare <32 x i8> @llvm.masked.load.v32i8(<32 x i8>*, i32, <32 x i1>, <32 x i8>)
	declare <16 x i8> @llvm.masked.load.v16i8(<16 x i8>*, i32, <16 x i1>, <16 x i8>)			declare <16 x i8> @llvm.masked.load.v16i8(<16 x i8>*, i32, <16 x i1>, <16 x i8>)
	declare <16 x i16> @llvm.masked.load.v16i16(<16 x i16>*, i32, <16 x i1>, <16 x i16>)			declare <16 x i16> @llvm.masked.load.v16i16(<16 x i16>*, i32, <16 x i1>, <16 x i16>)
	declare <8 x i8> @llvm.masked.load.v8i8(<8 x i8>*, i32, <8 x i1>, <8 x i8>)			declare <8 x i8> @llvm.masked.load.v8i8(<8 x i8>*, i32, <8 x i1>, <8 x i8>)
	declare <8 x i16> @llvm.masked.load.v8i16(<8 x i16>*, i32, <8 x i1>, <8 x i16>)			declare <8 x i16> @llvm.masked.load.v8i16(<8 x i16>*, i32, <8 x i1>, <8 x i16>)
	declare <8 x i32> @llvm.masked.load.v8i32(<8 x i32>*, i32, <8 x i1>, <8 x i32>)			declare <8 x i32> @llvm.masked.load.v8i32(<8 x i32>*, i32, <8 x i1>, <8 x i32>)
				declare <32 x i32> @llvm.masked.load.v32i32(<32 x i32>*, i32, <32 x i1>, <32 x i32>)
	declare <32 x i16> @llvm.masked.load.v32i16(<32 x i16>*, i32, <32 x i1>, <32 x i16>)			declare <32 x i16> @llvm.masked.load.v32i16(<32 x i16>*, i32, <32 x i1>, <32 x i16>)
				declare <64 x i16> @llvm.masked.load.v64i16(<64 x i16>*, i32, <64 x i1>, <64 x i16>)
	declare <16 x i32> @llvm.masked.load.v16i32(<16 x i32>*, i32, <16 x i1>, <16 x i32>)			declare <16 x i32> @llvm.masked.load.v16i32(<16 x i32>*, i32, <16 x i1>, <16 x i32>)
	declare <8 x i64> @llvm.masked.load.v8i64(<8 x i64>*, i32, <8 x i1>, <8 x i64>)			declare <8 x i64> @llvm.masked.load.v8i64(<8 x i64>*, i32, <8 x i1>, <8 x i64>)
	declare <8 x double> @llvm.masked.load.v8f64(<8 x double>*, i32, <8 x i1>, <8 x double>)			declare <8 x double> @llvm.masked.load.v8f64(<8 x double>*, i32, <8 x i1>, <8 x double>)

	attributes #0 = { "target-features"="+sve" }			attributes #0 = { "target-features"="+sve" }