This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.h
8/13
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-fp-select.ll
-
sve-fixed-length-int-select.ll

Differential D85364

[SVE][WIP] Implement lowering for fixed width select
ClosedPublic

Authored by cameron.mcinally on Aug 5 2020, 2:57 PM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
sdesmalen
eli.friedman
david-arm
Asif
dancgr
kmclaughlin
rengolin
efriedma
t.p.northover

Commits

rGa35c7f30769b: [SVE][WIP] Implement lowering for fixed length VSELECT to Scalable

Summary

Posting this patch seeking guidance...

This patch is lowering code for fixed width select. There's an issue with the select's condition operand though. Fixed width vectors-of-i1s (e.g. v8i1) aren't legal. They will be promoted to a larger legal integer vector (e.g. v8i64). We run into problems when we lower the fixed width masks to their legal scalable counterparts (e.g. nxv2i1).

Here's a hard example:

In AArch64TargetLowering::useSVEForFixedLengthVectorV(...), we have this comment:

// Fixed length predicates should be promoted to i8.
// NOTE: This is consistent with how NEON (and thus 64/128bit vectors) work.

That's problematic for a select like this:

select <8 x i1> %mask, <8 x i64> %op1, <8 x i64> %op2

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

What's the best way to proceed with this? Are these extra instructions something we live with? Or should we play with making the vXi1 mask types legal when we're lowering fixed width vectors? I don't have a good feel for how the latter will fit with the existing NEON mask though.

P.S. You'll notice the included tests do not have CHECK lines yet. I didn't want to do that work until there was a clear direction forward. I will post a small example and the generated assembly for the reviewer's convenience shortly.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

cameron.mcinally created this revision.Aug 5 2020, 2:57 PM

Herald added a reviewer: rengolin. · View Herald TranscriptAug 5 2020, 2:57 PM

Herald added a reviewer: efriedma. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, psnobl, hiraditya and 2 others. · View Herald Transcript

cameron.mcinally requested review of this revision.Aug 5 2020, 2:57 PM

Harbormaster completed remote builds in B67200: Diff 283395.Aug 5 2020, 2:58 PM

For this IR test and -aarch64-sve-vector-bits-min=512:

target triple = "aarch64-unknown-linux-gnu"

define void @select(<8 x double>* %a, <8 x double>* %b, <8 x i1>* %c) #0 {
; CHECK: select:
  %mask = load <8 x i1>, <8 x i1>* %c
  %op1 = load <8 x double>, <8 x double>* %a
  %op2 = load <8 x double>, <8 x double>* %b
  %sel = select <8 x i1> %mask, <8 x double> %op1, <8 x double> %op2
  store <8 x double> %sel, <8 x double>* %a, align 4 
  ret void
}       

attributes #0 = { "target-features"="+sve" }

this patch will generate:

	ldrb	w8, [x2]
	ptrue	p0.d, vl8
	mov	x9, sp
	ptrue	p1.d
	lsr	w10, w8, #7
	lsr	w11, w8, #6
	lsr	w12, w8, #5
	lsr	w13, w8, #4

        // Extend the mask: v8i1 -> v8i64
	sbfx	x10, x10, #0, #1
	sbfx	x11, x11, #0, #1
	stp	x11, x10, [sp, #48]
	sbfx	x11, x12, #0, #1
	sbfx	x12, x13, #0, #1
	lsr	w10, w8, #3
	stp	x12, x11, [sp, #32]
	lsr	w11, w8, #2
	sbfx	x10, x10, #0, #1
	sbfx	x11, x11, #0, #1
	stp	x11, x10, [sp, #16]
	sbfx	x10, x8, #0, #1
	lsr	w8, w8, #1
	sbfx	x8, x8, #0, #1
	stp	x10, x8, [sp]

        // Load extended mask into ZPR: nxv2i64
	ld1d	{ z0.d }, p0/z, [x9]
	ld1d	{ z1.d }, p0/z, [x0]
	ld1d	{ z2.d }, p0/z, [x1]

        // Truncate the mask: nxv2i64 -> nxv2i1
	and	z0.d, z0.d, #0x1
	cmpne	p2.d, p1/z, z0.d, #0

        // Combine the select mask with the fixed VL mask
        // Note: Not sure if this is *really* needed, or if we can trust the select mask,
        //       but that's a separate issue.
	and	p1.b, p1/z, p0.b, p2.b

	sel	z0.d, p1, z1.d, z2.d
	st1d	{ z0.d }, p0, [x0]

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

I had this concern when I was reviewing the code in question. @paulwalker-arm said he found the conversions were usually folded away in his prototype. Most i1 vectors will be produced by a compare that returns an nxv2i1 or something like that.

For load <8 x i1> specifically, the code is terrible because we're using the generic target-independent expansion, which goes element by element. If we cared, we could custom-lower it to something more reasonable. Nobody has looked into it because there isn't any way to generate that operation from C code.

In D85364#2197931, @efriedma wrote:

For load <8 x i1> specifically, the code is terrible because we're using the generic target-independent expansion, which goes element by element. If we cared, we could custom-lower it to something more reasonable. Nobody has looked into it because there isn't any way to generate that operation from C code.

That's interesting. If we could load the vXi1 and then vector extend it, it might be more palatable. I haven't checked if there are instructions to support that though. And I wonder if it will get weird with a vector-of-i1s smaller than a byte...

In D85364#2197915, @efriedma wrote:

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

I had this concern when I was reviewing the code in question. @paulwalker-arm said he found the conversions were usually folded away in his prototype. Most i1 vectors will be produced by a compare that returns an nxv2i1 or something like that.

I guess if it is amortized away, it's not a big deal. But a CMPNE is 4 cycles and a SX is 4 cycles. So we have an 8 cycle no-op. That's not great.

Also, I think we can get rid of this AND, unless I'm missing an edge case. The LSBs will still be 1 without the AND. I'm not sure if the top bits could get us in trouble though:

        // Truncate the mask: nxv2i64 -> nxv2i1
	and	z0.d, z0.d, #0x1
	cmpne	p2.d, p1/z, z0.d, #0

In D85364#2198164, @cameron.mcinally wrote:

In D85364#2197931, @efriedma wrote:

For load <8 x i1> specifically, the code is terrible because we're using the generic target-independent expansion, which goes element by element. If we cared, we could custom-lower it to something more reasonable. Nobody has looked into it because there isn't any way to generate that operation from C code.

That's interesting. If we could load the vXi1 and then vector extend it, it might be more palatable. I haven't checked if there are instructions to support that though. And I wonder if it will get weird with a vector-of-i1s smaller than a byte...

We could do something like svlsr(svdup(x), svindex(0,1)). But again, it's not really worth optimizing this.

In D85364#2197915, @efriedma wrote:

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

I had this concern when I was reviewing the code in question. @paulwalker-arm said he found the conversions were usually folded away in his prototype. Most i1 vectors will be produced by a compare that returns an nxv2i1 or something like that.

I guess if it is amortized away, it's not a big deal. But a CMPNE is 4 cycles and a SX is 4 cycles. So we have an 8 cycle no-op. That's not great.

Folded away, as in, DAGCombine gets rid of the extra instructions. If that doesn't work right now, we should be able to make it work with a little more code.

Also, I think we can get rid of this AND, unless I'm missing an edge case.

VSELECT masks are guaranteed to be all-ones or all-zeros for vectors types on ARM.

Matt added a subscriber: Matt.Aug 5 2020, 11:55 PM

In D85364#2198257, @efriedma wrote:

In D85364#2198164, @cameron.mcinally wrote:

In D85364#2197915, @efriedma wrote:

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

I had this concern when I was reviewing the code in question. @paulwalker-arm said he found the conversions were usually folded away in his prototype. Most i1 vectors will be produced by a compare that returns an nxv2i1 or something like that.

I guess if it is amortized away, it's not a big deal. But a CMPNE is 4 cycles and a SX is 4 cycles. So we have an 8 cycle no-op. That's not great.

Folded away, as in, DAGCombine gets rid of the extra instructions. If that doesn't work right now, we should be able to make it work with a little more code.

I see now. My misunderstanding was that the folding away only happened within the loop block. And that we'd still have to pay the cost pre and post loop.

Reiterating what (I think) Paul said from the SVE Sync-up call, it sounds like he has a plan for DAGCombine to clean these up during ISel, instead of during type legalization. If that's correct, then it should be all good.

Just to better put my position in words. This patch does highlight the expected problem with the fixed length code generation for SVE approach, namely extensions and truncations are more costly than they need to be for SVE. At the block level my expectation is that most should get folded away as more and more of the fixed length operations are lowered to SVE. Here though is where my expectation does not match reality because I believe we lower the ext/trunc operations too early meaning we don't have a DAGCombine phase that's able to remove them. This is why my original prototype made these operations legal and effectively lowered them during selection. That said it's presumably possible to recognise the target equivalent patterns?

This will not help when crossing block boundaries (perhaps it will for global isel?) so ultimately for the best code quality[1] we're going to want to consider i1 based vectors legal but that is not a straight forward transformation (perhaps a topic for the SVE sync call?) and while "expanding" operations is a much bigger performance issue I'd rather remained focused on that. Of course there's the argument that the more we move along the current path the harder it will be to change, but currently there's only a handful of functions that relate to this and I'm continually monitoring it.

[1] Well, other than using scalable vectors directly :)

Updating to exhibit the problem mentioned in D85546. One way to see the issue is:

llc -aarch64-sve-vector-bits-min=256 -asm-verbose=0 < llvm-project/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-select.ll

The problem originates in Legalize. DAGTypeLegalizer::PromoteIntOp_SIGN_EXTEND(...) will produce fixed length DAGs like this:

    t18: v8i16 = any_extend t16
  t20: v8i16 = sign_extend_inreg t18, ValueType:ch:v8i1
t12: v8f16 = vselect t20, t10, t11

This ISD::SIGN_EXTEND_INREG operation expects both the result and operand type to be the same.

However, when we go to lower this to AArch64ISD::SIGN_EXTEND_INREG_MERGE_PASSTHRU, that operation expects a half vector type for the operand. Here's a special case -- for illustrative purposes only:

def : Pat<(sext_inreg (nxv8i16 ZPR:$Zs), nxv8i8),  (SXTB_ZPmZ_H (IMPLICIT_DEF), (PTRUE_H 31), ZPR:$Zs)>;

At the surface, it's not really a big problem, since nxv8i8 and nxv8i16 are really the same register class. I just can't find a good way to bitcast between the two types with the fixed length lowering utilites.

cameron.mcinally mentioned this in D85546: [SVE] Add ISD nodes for predicated integer extend inreg operations.Aug 24 2020, 1:41 PM

In D85364#2234543, @cameron.mcinally wrote:

However, when we go to lower this to AArch64ISD::SIGN_EXTEND_INREG_MERGE_PASSTHRU, that operation expects a half vector type for the operand. Here's a special case -- for illustrative purposes only:

To be precise SIGN_EXTEND_INREG_MERGE_PASSTHRU expects its VT operand to have the same number of elements as its data operand. The element type of the VT operand describes the location of the sign bit that is used when performing the extension. This is identical to the requirements for SIGN_EXTEND_INREG.

I'm still not sure of the issue you are highlighting. When I apply the patch and run the test, the first thing I hit is a selection failure, which is fixed by D86394. I then hit an assert in AArch64TargetLowering::ReconstructShuffle, which looks like it needs updating to take wide and/or scalable vectors into account. I just stubbed the call out and the code generation succeeds, albeit with terrible code. Most of this looks to be due to the lowering of the i1 based load. When I replace that load with an integer load (matching the size of the floating-point operands) and use an icmp[1] to construct the select mask I get the expected code generation.

[1] We don't have lowering for floating-point compares yet. I have a work in progress patch but it's proving a little problematic as I'm trying to make use of existing expand code like we do for scalable vectors but it seems the two legalisation paths are not equivalent.

˜Bah, egg on my face. You're right that D86394 fixes the immediate issue. Sorry for the noise.

cameron.mcinally mentioned this in D86894: [SVE] Disable INSERT_SUBVECTOR DAGCombine for scalable vectors.Sep 1 2020, 1:04 PM

paulwalker-arm mentioned this in D71767: [POC][SVE] Allow code generation for fixed length vectorised loops [Patch 2/2]..Sep 2 2020, 4:22 AM

Updating Diff for recent upstream changes.

There are two maybe controversial changes, so will comment on them inline...

cameron.mcinally added a subscriber: t.p.northover.Sep 2 2020, 8:45 AM

cameron.mcinally added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341	@t.p.northover, does this change look okay to you? I suspect that this assert is assuming we only have NEON 64b and 128b vectors. Fixed width SVE now has larger vectors, and for these particular tests we want to extract a 1/4 width subvector from a single width vector.
9136	Just checking if anyone sees a problem with this change. We need to explicitly check that this is a 128b vector now, since SVE can see larger fixed width vectors.

cameron.mcinally added a reviewer: t.p.northover.Sep 2 2020, 8:45 AM

Fix bad copy-and-paste in CHECK lines. No significant changes made.

Update tests -- don't use pointer args for NEON sized vectors.

efriedma added inline comments.Sep 2 2020, 2:57 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7328	Is this assert also a potential issue?
7337–7341	Why is 4 special here?
7377	Is AArch64ISD::EXT going to work with arbitrarily large vectors?
9136	This looks fine.

cameron.mcinally added inline comments.Sep 3 2020, 6:35 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341	It's a splitting quirk. I'm not even sure how to succinctly describe it, but here goes... Let's assume a fixed width edge-case of -aarch64-sve-vector-bits-min=256. The weird splitting kicks in with vectors of >=64 elements. In particular, anytime we hit a v64i1 and the operands have elements of 32b, the problem comes up. A v64i1 isn't a legal type, so it's split to 2*v32i1. v32i1 isn't legal either, but we can promote it to v32i8 (a 256b type). This is where the first BUILD_VECTOR comes in. Call it v32i8=BUILD_VECTOR. So now we have something like: <v32i32> = vselect <v32i8> cond, <v32i32> op1, <v32i32> op2 Legalization continues, and v32i32 isn't legal, so we split again to v16i32, and again to v8i32 (now legal at 256b). The mask is also split during this, so we end up with a v8i8=BUILD_VECTOR for the mask operand. <v8i32> = vselect <v8i8> cond, <v8i32> op1, <v8i32> op2 Everything is legal now. The ReconstructShuffle(...) function now lowers the v8i8=BUILD_VECTOR, and finds the v32i8=BUILD_VECTOR as the source. So that v32/v8 gives us the magic 4. It's wonky, but I'm fairly sure it's bound at 4. Now that I'm writing this out, I'm seeing that ReconstructShuffle(...) probably needs to be guarded better in general. We shouldn't hit the problem with fixed lowering (yet?), but I suppose someone could write a pathological set of BUILD_VECTORs to trip it up. It's unfortunate that we can't distinguish masks from integer vectors. If we could, we'd avoid this. But that would mean making the vXi1 types legal, which I'm assuming causes headaches with NEON masks.

cameron.mcinally added inline comments.Sep 3 2020, 8:40 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7328	Not sure. The fixed-width lowering case should never get here. Splitting will only be taking a subvector out of a larger vector with the same element type.
7377	The fixed-width lowering case should never get here. We always have one source that's being split. But, yeah, I agree that this is a little sketchy. Maybe we should have a separate routine to lower the splitting case? I don't know. Just thinking out loud. Or maybe we should have a special 1 source case within ReconstructShuffle(...)? That could work.

If you aren't planning to extend ReconstructShuffle to work with arbitrary BUILD_VECTOR nodes, I'd recommend adding a bailout, and implementing a separate codepath for large vectors that only handles the cases you actually need.

Turns out that ReconstructShuffle(...) needs a lot of work to be updated. The EXTRACT_SUBVECTOR index assumes we're extracting a 1/2 width vector. I missed that.

I added a function that lowers specific BUILD_VECTORs to EXTRACT_SUBVECTORs, hoping that it would short-circuit the ReconstructShuffle(...) problem. That was a dead end, since small EXTRACT_SUBVECTORs aren't legal. And we'd still run into the ReconstructShuffle(...) problem with those small BUILD_VECTORs..

But when the EXTRACT_SUBVECTORs are large enough to be legal, the new function could be a win with some tuning. It replaces the scalar loads with a vector load. I've included that function in this Diff for review, but will break it out into a separate Diff if it seems interesting enough to pursue. Thoughts?

Fix comment.

Remove the BUILD_VECTOR->EXTRACT_SUBVECTOR transform so that this is easier to review. Will post a separate Diff for that.

When we legalize vectors greater than the fixed VL, the splitting code is pretty poor. That will need more work in the future.

cameron.mcinally marked 3 inline comments as done.Sep 14 2020, 7:45 AM

cameron.mcinally added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341	This now bails out when a non-NEON fixed width BUILD_VECTOR is seen. It's not ideal, but could be updated later if desired.

efriedma added inline comments.Sep 14 2020, 11:37 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341	I think you need to guard the creation of the AArch64ISD::EXT more explicitly; it won't do anything right now, but it might once we start messing with isShuffleMaskLegal.

cameron.mcinally added inline comments.Sep 14 2020, 2:13 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341	Did you have something particular in mind? We could check `Src.MaxElt - Src.MinElt <= NumSrcElts/2` (note the `<` is needed in case undefs are in play). That's true for all 3 if-conditions though, so would be better as an assert. We could also check that `Src.MinElt < NumSrcElts && Src.MaxElt >= NumSrcElts`. That should be covered by the 2 preceding if-conditions, but it would be good to fortify for future code changes. Am I missing anything else?

efriedma added inline comments.Sep 16 2020, 7:12 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341	Maybe something like the following? if (!SrcVT.s64BitVector()) { // Don't know how to lower AArch64ISD::EXT for SVE vectors. return SDValue(); }

Here's the most obvious choice -- glueing the new check to the AArch64ISD::EXT generation. That's good because the intention is obvious, and future code changes won't get in the way. But it could also be argued that the NEON-sized vector check should precede this entire if-else statement. I.e. the EXTRACT_SUBVECTORs have the same problem as AArch64ISD::EXT. None of this code is ready for SVE-sized fixed vectors.

Thoughts on this? I don't have a strong opinion.

LGTM

The other codepaths should do the right thing in theory, so I'm fine with leaving them, I think.

This revision is now accepted and ready to land.Sep 17 2020, 11:37 AM

Closed by commit rGa35c7f30769b: [SVE][WIP] Implement lowering for fixed length VSELECT to Scalable (authored by cameron.mcinally). · Explain WhySep 17 2020, 12:03 PM

This revision was automatically updated to reflect the committed changes.

cameron.mcinally added a commit: rGa35c7f30769b: [SVE][WIP] Implement lowering for fixed length VSELECT to Scalable.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

1 line

AArch64ISelLowering.cpp

44 lines

test/

CodeGen/

AArch64/

sve-fixed-length-fp-select.ll

317 lines

sve-fixed-length-int-select.ll

415 lines

Diff 292579

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 914 Lines • ▼ Show 20 Lines	private:
SDValue LowerSVEStructLoad(unsigned Intrinsic, ArrayRef<SDValue> LoadOps,		SDValue LowerSVEStructLoad(unsigned Intrinsic, ArrayRef<SDValue> LoadOps,
EVT VT, SelectionDAG &DAG, const SDLoc &DL) const;		EVT VT, SelectionDAG &DAG, const SDLoc &DL) const;

SDValue LowerFixedLengthVectorIntDivideToSVE(SDValue Op,		SDValue LowerFixedLengthVectorIntDivideToSVE(SDValue Op,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorIntExtendToSVE(SDValue Op,		SDValue LowerFixedLengthVectorIntExtendToSVE(SDValue Op,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorLoadToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorLoadToSVE(SDValue Op, SelectionDAG &DAG) const;
		SDValue LowerFixedLengthVectorSelectToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorSetccToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorSetccToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorStoreToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorStoreToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorTruncateToSVE(SDValue Op,		SDValue LowerFixedLengthVectorTruncateToSVE(SDValue Op,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,		SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
SmallVectorImpl<SDNode *> &Created) const override;		SmallVectorImpl<SDNode *> &Created) const override;
SDValue getSqrtEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,		SDValue getSqrtEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,
▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,168 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
setOperationAction(ISD::SRA, VT, Custom);		setOperationAction(ISD::SRA, VT, Custom);
setOperationAction(ISD::SRL, VT, Custom);		setOperationAction(ISD::SRL, VT, Custom);
setOperationAction(ISD::STORE, VT, Custom);		setOperationAction(ISD::STORE, VT, Custom);
setOperationAction(ISD::SUB, VT, Custom);		setOperationAction(ISD::SUB, VT, Custom);
setOperationAction(ISD::TRUNCATE, VT, Custom);		setOperationAction(ISD::TRUNCATE, VT, Custom);
setOperationAction(ISD::UDIV, VT, Custom);		setOperationAction(ISD::UDIV, VT, Custom);
setOperationAction(ISD::UMAX, VT, Custom);		setOperationAction(ISD::UMAX, VT, Custom);
setOperationAction(ISD::UMIN, VT, Custom);		setOperationAction(ISD::UMIN, VT, Custom);
		setOperationAction(ISD::VSELECT, VT, Custom);
setOperationAction(ISD::XOR, VT, Custom);		setOperationAction(ISD::XOR, VT, Custom);
setOperationAction(ISD::ZERO_EXTEND, VT, Custom);		setOperationAction(ISD::ZERO_EXTEND, VT, Custom);
}		}

void AArch64TargetLowering::addDRTypeForNEON(MVT VT) {		void AArch64TargetLowering::addDRTypeForNEON(MVT VT) {
addRegisterClass(VT, &AArch64::FPR64RegClass);		addRegisterClass(VT, &AArch64::FPR64RegClass);
addTypeForNEON(VT, MVT::v2i32);		addTypeForNEON(VT, MVT::v2i32);
}		}
▲ Show 20 Lines • Show All 2,665 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
case ISD::AND:		case ISD::AND:
return LowerToScalableOp(Op, DAG);		return LowerToScalableOp(Op, DAG);
case ISD::SUB:		case ISD::SUB:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::SUB_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::SUB_PRED);
case ISD::FMAXNUM:		case ISD::FMAXNUM:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMAXNM_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMAXNM_PRED);
case ISD::FMINNUM:		case ISD::FMINNUM:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMINNM_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMINNM_PRED);
		case ISD::VSELECT:
		return LowerFixedLengthVectorSelectToSVE(Op, DAG);
}		}
}		}

bool AArch64TargetLowering::useSVEForFixedLengthVectors() const {		bool AArch64TargetLowering::useSVEForFixedLengthVectors() const {
// Prefer NEON unless larger SVE registers are available.		// Prefer NEON unless larger SVE registers are available.
return Subtarget->hasSVE() && Subtarget->getMinSVEVectorSizeInBits() >= 256;		return Subtarget->hasSVE() && Subtarget->getMinSVEVectorSizeInBits() >= 256;
}		}

▲ Show 20 Lines • Show All 3,451 Lines • ▼ Show 20 Lines	for (auto &Src : Sources) {

// This stage of the search produces a source with the same element type as		// This stage of the search produces a source with the same element type as
// the original, but with a total width matching the BUILD_VECTOR output.		// the original, but with a total width matching the BUILD_VECTOR output.
EVT EltVT = SrcVT.getVectorElementType();		EVT EltVT = SrcVT.getVectorElementType();
unsigned NumSrcElts = VT.getSizeInBits() / EltVT.getSizeInBits();		unsigned NumSrcElts = VT.getSizeInBits() / EltVT.getSizeInBits();
EVT DestVT = EVT::getVectorVT(*DAG.getContext(), EltVT, NumSrcElts);		EVT DestVT = EVT::getVectorVT(*DAG.getContext(), EltVT, NumSrcElts);

if (SrcVT.getSizeInBits() < VT.getSizeInBits()) {		if (SrcVT.getSizeInBits() < VT.getSizeInBits()) {
assert(2 * SrcVT.getSizeInBits() == VT.getSizeInBits());		assert(2 * SrcVT.getSizeInBits() == VT.getSizeInBits());
		efriedmaUnsubmitted Done Reply Inline Actions Is this assert also a potential issue? efriedma: Is this assert also a potential issue?
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions Not sure. The fixed-width lowering case should never get here. Splitting will only be taking a subvector out of a larger vector with the same element type. cameron.mcinally: Not sure. The fixed-width lowering case should never get here. Splitting will only be taking a…
// We can pad out the smaller vector for free, so if it's part of a		// We can pad out the smaller vector for free, so if it's part of a
// shuffle...		// shuffle...
Src.ShuffleVec =		Src.ShuffleVec =
DAG.getNode(ISD::CONCAT_VECTORS, dl, DestVT, Src.ShuffleVec,		DAG.getNode(ISD::CONCAT_VECTORS, dl, DestVT, Src.ShuffleVec,
DAG.getUNDEF(Src.ShuffleVec.getValueType()));		DAG.getUNDEF(Src.ShuffleVec.getValueType()));
continue;		continue;
}		}

assert(SrcVT.getSizeInBits() == 2 * VT.getSizeInBits());		if (SrcVT.getSizeInBits() != 2 * VT.getSizeInBits()) {
		LLVM_DEBUG(
		dbgs() << "Reshuffle failed: result vector too small to extract\n");
		return SDValue();
		}
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions @t.p.northover, does this change look okay to you? I suspect that this assert is assuming we only have NEON 64b and 128b vectors. Fixed width SVE now has larger vectors, and for these particular tests we want to extract a 1/4 width subvector from a single width vector. cameron.mcinally: @t.p.northover, does this change look okay to you? I suspect that this assert is assuming we…
		efriedmaUnsubmitted Done Reply Inline Actions Why is 4 special here? efriedma: Why is 4 special here?
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions It's a splitting quirk. I'm not even sure how to succinctly describe it, but here goes... Let's assume a fixed width edge-case of -aarch64-sve-vector-bits-min=256. The weird splitting kicks in with vectors of >=64 elements. In particular, anytime we hit a v64i1 and the operands have elements of 32b, the problem comes up. A v64i1 isn't a legal type, so it's split to 2v32i1. v32i1 isn't legal either, but we can promote it to v32i8 (a 256b type). This is where the first BUILD_VECTOR comes in. Call it v32i8=BUILD_VECTOR. So now we have something like: <v32i32> = vselect <v32i8> cond, <v32i32> op1, <v32i32> op2 Legalization continues, and v32i32 isn't legal, so we split again to v16i32, and again to v8i32 (now legal at 256b). The mask is also split during this, so we end up with a v8i8=BUILD_VECTOR for the mask operand. <v8i32> = vselect <v8i8> cond, <v8i32> op1, <v8i32> op2 Everything is legal now. The ReconstructShuffle(...) function now lowers the v8i8=BUILD_VECTOR, and finds the v32i8=BUILD_VECTOR as the source. So that v32/v8 gives us the magic 4. It's wonky, but I'm fairly sure it's bound at 4. Now that I'm writing this out, I'm seeing that ReconstructShuffle(...) probably needs to be guarded better in general. We shouldn't hit the problem with fixed lowering (yet?), but I suppose someone could write a pathological set of BUILD_VECTORs to trip it up. It's unfortunate that we can't distinguish masks from integer vectors. If we could, we'd avoid this. But that would mean making the vXi1 types legal, which I'm assuming causes headaches with NEON masks. cameron.mcinally:* It's a splitting quirk. I'm not even sure how to succinctly describe it, but here goes...
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions This now bails out when a non-NEON fixed width BUILD_VECTOR is seen. It's not ideal, but could be updated later if desired. cameron.mcinally: This now bails out when a non-NEON fixed width BUILD_VECTOR is seen. It's not ideal, but could…
		efriedmaUnsubmitted Not Done Reply Inline Actions I think you need to guard the creation of the AArch64ISD::EXT more explicitly; it won't do anything right now, but it might once we start messing with isShuffleMaskLegal. efriedma: I think you need to guard the creation of the AArch64ISD::EXT more explicitly; it won't do…
		cameron.mcinallyAuthorUnsubmitted Not Done Reply Inline Actions Did you have something particular in mind? We could check `Src.MaxElt - Src.MinElt <= NumSrcElts/2` (note the `<` is needed in case undefs are in play). That's true for all 3 if-conditions though, so would be better as an assert. We could also check that `Src.MinElt < NumSrcElts && Src.MaxElt >= NumSrcElts`. That should be covered by the 2 preceding if-conditions, but it would be good to fortify for future code changes. Am I missing anything else? cameron.mcinally: Did you have something particular in mind? We could check `Src.MaxElt - Src.MinElt <=…
		efriedmaUnsubmitted Not Done Reply Inline Actions Maybe something like the following? if (!SrcVT.s64BitVector()) { // Don't know how to lower AArch64ISD::EXT for SVE vectors. return SDValue(); } efriedma: Maybe something like the following? ``` if (!SrcVT.s64BitVector()) { // Don't know how to…

if (Src.MaxElt - Src.MinElt >= NumSrcElts) {		if (Src.MaxElt - Src.MinElt >= NumSrcElts) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "Reshuffle failed: span too large for a VEXT to cope\n");		dbgs() << "Reshuffle failed: span too large for a VEXT to cope\n");
return SDValue();		return SDValue();
}		}

if (Src.MinElt >= NumSrcElts) {		if (Src.MinElt >= NumSrcElts) {
Show All 12 Lines	if (Src.MinElt >= NumSrcElts) {
SDValue VEXTSrc1 =		SDValue VEXTSrc1 =
DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DestVT, Src.ShuffleVec,		DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DestVT, Src.ShuffleVec,
DAG.getConstant(0, dl, MVT::i64));		DAG.getConstant(0, dl, MVT::i64));
SDValue VEXTSrc2 =		SDValue VEXTSrc2 =
DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DestVT, Src.ShuffleVec,		DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DestVT, Src.ShuffleVec,
DAG.getConstant(NumSrcElts, dl, MVT::i64));		DAG.getConstant(NumSrcElts, dl, MVT::i64));
unsigned Imm = Src.MinElt * getExtFactor(VEXTSrc1);		unsigned Imm = Src.MinElt * getExtFactor(VEXTSrc1);

		if (!SrcVT.is64BitVector()) {
		LLVM_DEBUG(
		dbgs() << "Reshuffle failed: don't know how to lower AArch64ISD::EXT "
		"for SVE vectors.");
		return SDValue();
		}

Src.ShuffleVec = DAG.getNode(AArch64ISD::EXT, dl, DestVT, VEXTSrc1,		Src.ShuffleVec = DAG.getNode(AArch64ISD::EXT, dl, DestVT, VEXTSrc1,
		efriedmaUnsubmitted Done Reply Inline Actions Is AArch64ISD::EXT going to work with arbitrarily large vectors? efriedma: Is AArch64ISD::EXT going to work with arbitrarily large vectors?
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions The fixed-width lowering case should never get here. We always have one source that's being split. But, yeah, I agree that this is a little sketchy. Maybe we should have a separate routine to lower the splitting case? I don't know. Just thinking out loud. Or maybe we should have a special 1 source case within ReconstructShuffle(...)? That could work. cameron.mcinally: The fixed-width lowering case should never get here. We always have one source that's being…
VEXTSrc2,		VEXTSrc2,
DAG.getConstant(Imm, dl, MVT::i32));		DAG.getConstant(Imm, dl, MVT::i32));
Src.WindowBase = -Src.MinElt;		Src.WindowBase = -Src.MinElt;
}		}
}		}

// Another possible incompatibility occurs from the vector element types. We		// Another possible incompatibility occurs from the vector element types. We
// can fix this by bitcasting the source vectors to the same type we intend		// can fix this by bitcasting the source vectors to the same type we intend
▲ Show 20 Lines • Show All 1,740 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerEXTRACT_SUBVECTOR(SDValue Op,
}		}

// This will get lowered to an appropriate EXTRACT_SUBREG in ISel.		// This will get lowered to an appropriate EXTRACT_SUBREG in ISel.
if (Idx == 0 && InVT.getSizeInBits() <= 128)		if (Idx == 0 && InVT.getSizeInBits() <= 128)
return Op;		return Op;

// If this is extracting the upper 64-bits of a 128-bit vector, we match		// If this is extracting the upper 64-bits of a 128-bit vector, we match
// that directly.		// that directly.
if (Size == 64 && Idx * InVT.getScalarSizeInBits() == 64)		if (Size == 64 && Idx * InVT.getScalarSizeInBits() == 64 &&
		InVT.getSizeInBits() == 128)
return Op;		return Op;
		cameron.mcinallyAuthorUnsubmitted Not Done Reply Inline Actions Just checking if anyone sees a problem with this change. We need to explicitly check that this is a 128b vector now, since SVE can see larger fixed width vectors. cameron.mcinally: Just checking if anyone sees a problem with this change. We need to explicitly check that this…
		efriedmaUnsubmitted Not Done Reply Inline Actions This looks fine. efriedma: This looks fine.

return SDValue();		return SDValue();
}		}

SDValue AArch64TargetLowering::LowerINSERT_SUBVECTOR(SDValue Op,		SDValue AArch64TargetLowering::LowerINSERT_SUBVECTOR(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
assert(Op.getValueType().isScalableVector() &&		assert(Op.getValueType().isScalableVector() &&
"Only expect to lower inserts into scalable vectors!");		"Only expect to lower inserts into scalable vectors!");
▲ Show 20 Lines • Show All 6,741 Lines • ▼ Show 20 Lines	assert(useSVEForFixedLengthVectorVT(V.getValueType()) &&
"Only fixed length vectors are supported!");		"Only fixed length vectors are supported!");
Ops.push_back(convertToScalableVector(DAG, ContainerVT, V));		Ops.push_back(convertToScalableVector(DAG, ContainerVT, V));
}		}

auto ScalableRes = DAG.getNode(Op.getOpcode(), SDLoc(Op), ContainerVT, Ops);		auto ScalableRes = DAG.getNode(Op.getOpcode(), SDLoc(Op), ContainerVT, Ops);
return convertFromScalableVector(DAG, VT, ScalableRes);		return convertFromScalableVector(DAG, VT, ScalableRes);
}		}

		SDValue
		AArch64TargetLowering::LowerFixedLengthVectorSelectToSVE(SDValue Op,
		SelectionDAG &DAG) const {
		EVT VT = Op.getValueType();
		SDLoc DL(Op);

		EVT InVT = Op.getOperand(1).getValueType();
		EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);
		SDValue Op1 = convertToScalableVector(DAG, ContainerVT, Op->getOperand(1));
		SDValue Op2 = convertToScalableVector(DAG, ContainerVT, Op->getOperand(2));

		// Convert the mask to a predicated (NOTE: We don't need to worry about
		// inactive lanes since VSELECT is safe when given undefined elements).
		EVT MaskVT = Op.getOperand(0).getValueType();
		EVT MaskContainerVT = getContainerForFixedLengthVector(DAG, MaskVT);
		auto Mask = convertToScalableVector(DAG, MaskContainerVT, Op.getOperand(0));
		Mask = DAG.getNode(ISD::TRUNCATE, DL,
		MaskContainerVT.changeVectorElementType(MVT::i1), Mask);

		auto ScalableRes = DAG.getNode(ISD::VSELECT, DL, ContainerVT,
		Mask, Op1, Op2);

		return convertFromScalableVector(DAG, VT, ScalableRes);
		}

SDValue AArch64TargetLowering::LowerFixedLengthVectorSetccToSVE(		SDValue AArch64TargetLowering::LowerFixedLengthVectorSetccToSVE(
SDValue Op, SelectionDAG &DAG) const {		SDValue Op, SelectionDAG &DAG) const {
SDLoc DL(Op);		SDLoc DL(Op);
EVT InVT = Op.getOperand(0).getValueType();		EVT InVT = Op.getOperand(0).getValueType();
EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);		EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);

assert(useSVEForFixedLengthVectorVT(InVT) &&		assert(useSVEForFixedLengthVectorVT(InVT) &&
"Only expected to lower fixed length vector operation!");		"Only expected to lower fixed length vector operation!");
Show All 19 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-fp-select.ll

This file was added.

				; RUN: llc -aarch64-sve-vector-bits-min=128 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE
				; RUN: llc -aarch64-sve-vector-bits-min=256 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=384 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK
				; RUN: llc -aarch64-sve-vector-bits-min=512 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=640 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=768 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=896 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=1024 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1152 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1280 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1408 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1536 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1664 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1792 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1920 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=2048 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024,VBITS_GE_2048

				target triple = "aarch64-unknown-linux-gnu"

				; Don't use SVE when its registers are no bigger than NEON.
				; NO_SVE-NOT: ptrue

				; Don't use SVE for 64-bit vectors.
				define <4 x half> @select_v4f16(<4 x half> %op1, <4 x half> %op2, <4 x i1> %mask) #0 {
				; CHECK-LABEL: select_v4f16:
				; CHECK: bif v0.8b, v1.8b, v2.8b
				; CHECK: ret
				%sel = select <4 x i1> %mask, <4 x half> %op1, <4 x half> %op2
				ret <4 x half> %sel
				}

				; Don't use SVE for 128-bit vectors.
				define <8 x half> @select_v8f16(<8 x half> %op1, <8 x half> %op2, <8 x i1> %mask) #0 {
				; CHECK-LABEL: select_v8f16:
				; CHECK: bif v0.16b, v1.16b, v2.16b
				; CHECK: ret
				%sel = select <8 x i1> %mask, <8 x half> %op1, <8 x half> %op2
				ret <8 x half> %sel
				}

				define void @select_v16f16(<16 x half>* %a, <16 x half>* %b, <16 x i1>* %c) #0 {
				; CHECK-LABEL: select_v16f16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),16)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].h
				; VBITS_GE_256: ld1h { [[MASK:z[0-9]+]].h }, [[PG]]/z, [x9]
				; VBITS_GE_256-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: and [[AND:z[0-9]+]].h, [[MASK]].h, #0x1
				; VBITS_GE_256-NEXT: cmpne [[COND:p[0-9]+]].h, [[PG1]]/z, [[AND]].h, #0
				; VBITS_GE_256-NEXT: sel [[RES:z[0-9]+]].h, [[COND]], [[OP1]].h, [[OP2]].h
				; VBITS_GE_256-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_256: ret
				%mask = load <16 x i1>, <16 x i1>* %c
				%op1 = load <16 x half>, <16 x half>* %a
				%op2 = load <16 x half>, <16 x half>* %b
				%sel = select <16 x i1> %mask, <16 x half> %op1, <16 x half> %op2
				store <16 x half> %sel, <16 x half>* %a
				ret void
				}

				define void @select_v32f16(<32 x half>* %a, <32 x half>* %b, <32 x i1>* %c) #0 {
				; CHECK-LABEL: select_v32f16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),32)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].h
				; VBITS_GE_512: ld1h { [[MASK:z[0-9]+]].h }, [[PG]]/z, [x9]
				; VBITS_GE_512-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: and [[AND:z[0-9]+]].h, [[MASK]].h, #0x1
				; VBITS_GE_512-NEXT: cmpne [[COND:p[0-9]+]].h, [[PG1]]/z, [[AND]].h, #0
				; VBITS_GE_512-NEXT: sel [[RES:z[0-9]+]].h, [[COND]], [[OP1]].h, [[OP2]].h
				; VBITS_GE_512-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_512: ret
				%mask = load <32 x i1>, <32 x i1>* %c
				%op1 = load <32 x half>, <32 x half>* %a
				%op2 = load <32 x half>, <32 x half>* %b
				%sel = select <32 x i1> %mask, <32 x half> %op1, <32 x half> %op2
				store <32 x half> %sel, <32 x half>* %a
				ret void
				}

				define void @select_v64f16(<64 x half>* %a, <64 x half>* %b, <64 x i1>* %c) #0 {
				; CHECK-LABEL: select_v64f16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),64)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].h
				; VBITS_GE_1024: ld1h { [[MASK:z[0-9]+]].h }, [[PG]]/z, [x9]
				; VBITS_GE_1024-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: and [[AND:z[0-9]+]].h, [[MASK]].h, #0x1
				; VBITS_GE_1024-NEXT: cmpne [[COND:p[0-9]+]].h, [[PG1]]/z, [[AND]].h, #0
				; VBITS_GE_1024-NEXT: sel [[RES:z[0-9]+]].h, [[COND]], [[OP1]].h, [[OP2]].h
				; VBITS_GE_1024-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_1024: ret
				%mask = load <64 x i1>, <64 x i1>* %c
				%op1 = load <64 x half>, <64 x half>* %a
				%op2 = load <64 x half>, <64 x half>* %b
				%sel = select <64 x i1> %mask, <64 x half> %op1, <64 x half> %op2
				store <64 x half> %sel, <64 x half>* %a
				ret void
				}

				define void @select_v128f16(<128 x half>* %a, <128 x half>* %b, <128 x i1>* %c) #0 {
				; CHECK-LABEL: select_v128f16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),128)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].h
				; VBITS_GE_2048: ld1h { [[MASK:z[0-9]+]].h }, [[PG]]/z, [x9]
				; VBITS_GE_2048-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: and [[AND:z[0-9]+]].h, [[MASK]].h, #0x1
				; VBITS_GE_2048-NEXT: cmpne [[COND:p[0-9]+]].h, [[PG1]]/z, [[AND]].h, #0
				; VBITS_GE_2048-NEXT: sel [[RES:z[0-9]+]].h, [[COND]], [[OP1]].h, [[OP2]].h
				; VBITS_GE_2048-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_2048: ret
				%mask = load <128 x i1>, <128 x i1>* %c
				%op1 = load <128 x half>, <128 x half>* %a
				%op2 = load <128 x half>, <128 x half>* %b
				%sel = select <128 x i1> %mask, <128 x half> %op1, <128 x half> %op2
				store <128 x half> %sel, <128 x half>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <2 x float> @select_v2f32(<2 x float> %op1, <2 x float> %op2, <2 x i1> %mask) #0 {
				; CHECK-LABEL: select_v2f32:
				; CHECK: bif v0.8b, v1.8b, v2.8b
				; CHECK: ret
				%sel = select <2 x i1> %mask, <2 x float> %op1, <2 x float> %op2
				ret <2 x float> %sel
				}

				; Don't use SVE for 128-bit vectors.
				define <4 x float> @select_v4f32(<4 x float> %op1, <4 x float> %op2, <4 x i1> %mask) #0 {
				; CHECK-LABEL: select_v4f32:
				; CHECK: bif v0.16b, v1.16b, v2.16b
				; CHECK: ret
				%sel = select <4 x i1> %mask, <4 x float> %op1, <4 x float> %op2
				ret <4 x float> %sel
				}

				define void @select_v8f32(<8 x float>* %a, <8 x float>* %b, <8 x i1>* %c) #0 {
				; CHECK-LABEL: select_v8f32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),8)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_256: ld1w { [[MASK:z[0-9]+]].s }, [[PG]]/z, [x9]
				; VBITS_GE_256-NEXT: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: and [[AND:z[0-9]+]].s, [[MASK]].s, #0x1
				; VBITS_GE_256-NEXT: cmpne [[COND:p[0-9]+]].s, [[PG1]]/z, [[AND]].s, #0
				; VBITS_GE_256-NEXT: sel [[RES:z[0-9]+]].s, [[COND]], [[OP1]].s, [[OP2]].s
				; VBITS_GE_256-NEXT: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_GE_256: ret
				%mask = load <8 x i1>, <8 x i1>* %c
				%op1 = load <8 x float>, <8 x float>* %a
				%op2 = load <8 x float>, <8 x float>* %b
				%sel = select <8 x i1> %mask, <8 x float> %op1, <8 x float> %op2
				store <8 x float> %sel, <8 x float>* %a
				ret void
				}

				define void @select_v16f32(<16 x float>* %a, <16 x float>* %b, <16 x i1>* %c) #0 {
				; CHECK-LABEL: select_v16f32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),16)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_512: ld1w { [[MASK:z[0-9]+]].s }, [[PG]]/z, [x9]
				; VBITS_GE_512-NEXT: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: and [[AND:z[0-9]+]].s, [[MASK]].s, #0x1
				; VBITS_GE_512-NEXT: cmpne [[COND:p[0-9]+]].s, [[PG1]]/z, [[AND]].s, #0
				; VBITS_GE_512-NEXT: sel [[RES:z[0-9]+]].s, [[COND]], [[OP1]].s, [[OP2]].s
				; VBITS_GE_512-NEXT: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_GE_512: ret
				%mask = load <16 x i1>, <16 x i1>* %c
				%op1 = load <16 x float>, <16 x float>* %a
				%op2 = load <16 x float>, <16 x float>* %b
				%sel = select <16 x i1> %mask, <16 x float> %op1, <16 x float> %op2
				store <16 x float> %sel, <16 x float>* %a
				ret void
				}

				define void @select_v32f32(<32 x float>* %a, <32 x float>* %b, <32 x i1>* %c) #0 {
				; CHECK-LABEL: select_v32f32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),32)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_1024: ld1w { [[MASK:z[0-9]+]].s }, [[PG]]/z, [x9]
				; VBITS_GE_1024-NEXT: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: and [[AND:z[0-9]+]].s, [[MASK]].s, #0x1
				; VBITS_GE_1024-NEXT: cmpne [[COND:p[0-9]+]].s, [[PG1]]/z, [[AND]].s, #0
				; VBITS_GE_1024-NEXT: sel [[RES:z[0-9]+]].s, [[COND]], [[OP1]].s, [[OP2]].s
				; VBITS_GE_1024-NEXT: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_GE_1024: ret
				%mask = load <32 x i1>, <32 x i1>* %c
				%op1 = load <32 x float>, <32 x float>* %a
				%op2 = load <32 x float>, <32 x float>* %b
				%sel = select <32 x i1> %mask, <32 x float> %op1, <32 x float> %op2
				store <32 x float> %sel, <32 x float>* %a
				ret void
				}

				define void @select_v64f32(<64 x float>* %a, <64 x float>* %b, <64 x i1>* %c) #0 {
				; CHECK-LABEL: select_v64f32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),64)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_2048: ld1w { [[MASK:z[0-9]+]].s }, [[PG]]/z, [x9]
				; VBITS_GE_2048-NEXT: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: and [[AND:z[0-9]+]].s, [[MASK]].s, #0x1
				; VBITS_GE_2048-NEXT: cmpne [[COND:p[0-9]+]].s, [[PG1]]/z, [[AND]].s, #0
				; VBITS_GE_2048-NEXT: sel [[RES:z[0-9]+]].s, [[COND]], [[OP1]].s, [[OP2]].s
				; VBITS_GE_2048-NEXT: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_GE_2048: ret
				%mask = load <64 x i1>, <64 x i1>* %c
				%op1 = load <64 x float>, <64 x float>* %a
				%op2 = load <64 x float>, <64 x float>* %b
				%sel = select <64 x i1> %mask, <64 x float> %op1, <64 x float> %op2
				store <64 x float> %sel, <64 x float>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <1 x double> @select_v1f64(<1 x double> %op1, <1 x double> %op2, <1 x i1> %mask) #0 {
				; CHECK-LABEL: select_v1f64:
				; CHECK: bif v0.8b, v1.8b, v2.8b
				; CHECK: ret
				%sel = select <1 x i1> %mask, <1 x double> %op1, <1 x double> %op2
				ret <1 x double> %sel
				}

				; Don't use SVE for 128-bit vectors.
				define <2 x double> @select_v2f64(<2 x double> %op1, <2 x double> %op2, <2 x i1> %mask) #0 {
				; CHECK-LABEL: select_v2f64:
				; CHECK: bif v0.16b, v1.16b, v2.16b
				; CHECK: ret
				%sel = select <2 x i1> %mask, <2 x double> %op1, <2 x double> %op2
				ret <2 x double> %sel
				}

				define void @select_v4f64(<4 x double>* %a, <4 x double>* %b, <4 x i1>* %c) #0 {
				; CHECK-LABEL: select_v4f64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),4)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].d
				; VBITS_GE_256: ld1d { [[MASK:z[0-9]+]].d }, [[PG]]/z, [x9]
				; VBITS_GE_256-NEXT: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: and [[AND:z[0-9]+]].d, [[MASK]].d, #0x1
				; VBITS_GE_256-NEXT: cmpne [[COND:p[0-9]+]].d, [[PG1]]/z, [[AND]].d, #0
				; VBITS_GE_256-NEXT: sel [[RES:z[0-9]+]].d, [[COND]], [[OP1]].d, [[OP2]].d
				; VBITS_GE_256-NEXT: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_GE_256: ret
				%mask = load <4 x i1>, <4 x i1>* %c
				%op1 = load <4 x double>, <4 x double>* %a
				%op2 = load <4 x double>, <4 x double>* %b
				%sel = select <4 x i1> %mask, <4 x double> %op1, <4 x double> %op2
				store <4 x double> %sel, <4 x double>* %a
				ret void
				}

				define void @select_v8f64(<8 x double>* %a, <8 x double>* %b, <8 x i1>* %c) #0 {
				; CHECK-LABEL: select_v8f64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),8)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].d
				; VBITS_GE_512: ld1d { [[MASK:z[0-9]+]].d }, [[PG]]/z, [x9]
				; VBITS_GE_512-NEXT: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: and [[AND:z[0-9]+]].d, [[MASK]].d, #0x1
				; VBITS_GE_512-NEXT: cmpne [[COND:p[0-9]+]].d, [[PG1]]/z, [[AND]].d, #0
				; VBITS_GE_512-NEXT: sel [[RES:z[0-9]+]].d, [[COND]], [[OP1]].d, [[OP2]].d
				; VBITS_GE_512-NEXT: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_GE_512: ret
				%mask = load <8 x i1>, <8 x i1>* %c
				%op1 = load <8 x double>, <8 x double>* %a
				%op2 = load <8 x double>, <8 x double>* %b
				%sel = select <8 x i1> %mask, <8 x double> %op1, <8 x double> %op2
				store <8 x double> %sel, <8 x double>* %a
				ret void
				}

				define void @select_v16f64(<16 x double>* %a, <16 x double>* %b, <16 x i1>* %c) #0 {
				; CHECK-LABEL: select_v16f64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),16)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].d
				; VBITS_GE_1024: ld1d { [[MASK:z[0-9]+]].d }, [[PG]]/z, [x9]
				; VBITS_GE_1024-NEXT: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: and [[AND:z[0-9]+]].d, [[MASK]].d, #0x1
				; VBITS_GE_1024-NEXT: cmpne [[COND:p[0-9]+]].d, [[PG1]]/z, [[AND]].d, #0
				; VBITS_GE_1024-NEXT: sel [[RES:z[0-9]+]].d, [[COND]], [[OP1]].d, [[OP2]].d
				; VBITS_GE_1024-NEXT: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_GE_1024: ret
				%mask = load <16 x i1>, <16 x i1>* %c
				%op1 = load <16 x double>, <16 x double>* %a
				%op2 = load <16 x double>, <16 x double>* %b
				%sel = select <16 x i1> %mask, <16 x double> %op1, <16 x double> %op2
				store <16 x double> %sel, <16 x double>* %a
				ret void
				}

				define void @select_v32f64(<32 x double>* %a, <32 x double>* %b, <32 x i1>* %c) #0 {
				; CHECK-LABEL: select_v32f64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),32)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].d
				; VBITS_GE_2048: ld1d { [[MASK:z[0-9]+]].d }, [[PG]]/z, [x9]
				; VBITS_GE_2048-NEXT: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: and [[AND:z[0-9]+]].d, [[MASK]].d, #0x1
				; VBITS_GE_2048-NEXT: cmpne [[COND:p[0-9]+]].d, [[PG1]]/z, [[AND]].d, #0
				; VBITS_GE_2048-NEXT: sel [[RES:z[0-9]+]].d, [[COND]], [[OP1]].d, [[OP2]].d
				; VBITS_GE_2048-NEXT: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_GE_2048: ret
				%mask = load <32 x i1>, <32 x i1>* %c
				%op1 = load <32 x double>, <32 x double>* %a
				%op2 = load <32 x double>, <32 x double>* %b
				%sel = select <32 x i1> %mask, <32 x double> %op1, <32 x double> %op2
				store <32 x double> %sel, <32 x double>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AArch64/sve-fixed-length-int-select.ll

This file was added.

				; RUN: llc -aarch64-sve-vector-bits-min=128 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE
				; RUN: llc -aarch64-sve-vector-bits-min=256 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=384 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK
				; RUN: llc -aarch64-sve-vector-bits-min=512 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=640 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=768 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=896 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=1024 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1152 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1280 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1408 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1536 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1664 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1792 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1920 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=2048 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024,VBITS_GE_2048

				target triple = "aarch64-unknown-linux-gnu"

				; Don't use SVE when its registers are no bigger than NEON.
				; NO_SVE-NOT: ptrue

				; Don't use SVE for 64-bit vectors.
				define <8 x i8> @select_v8i8(<8 x i8> %op1, <8 x i8> %op2, <8 x i1> %mask) #0 {
				; CHECK: select_v8i8:
				; CHECK: bif v0.8b, v1.8b, v2.8b
				; CHECK: ret
				%sel = select <8 x i1> %mask, <8 x i8> %op1, <8 x i8> %op2
				ret <8 x i8> %sel
				}

				; Don't use SVE for 128-bit vectors.
				define <16 x i8> @select_v16i8(<16 x i8> %op1, <16 x i8> %op2, <16 x i1> %mask) #0 {
				; CHECK: select_v16i8:
				; CHECK: bif v0.16b, v1.16b, v2.16b
				; CHECK: ret
				%sel = select <16 x i1> %mask, <16 x i8> %op1, <16 x i8> %op2
				ret <16 x i8> %sel
				}

				define void @select_v32i8(<32 x i8>* %a, <32 x i8>* %b, <32 x i1>* %c) #0 {
				; CHECK: select_v32i8:
				; CHECK: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,32)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].b
				; VBITS_GE_256: ld1b { [[MASK:z[0-9]+]].b }, [[PG]]/z, [x9]
				; VBITS_GE_256-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: and [[AND:z[0-9]+]].b, [[MASK]].b, #0x1
				; VBITS_GE_256-NEXT: cmpne [[COND:p[0-9]+]].b, [[PG1]]/z, [[AND]].b, #0
				; VBITS_GE_256-NEXT: sel [[RES:z[0-9]+]].b, [[COND]], [[OP1]].b, [[OP2]].b
				; VBITS_GE_256-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_256: ret
				%mask = load <32 x i1>, <32 x i1>* %c
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%sel = select <32 x i1> %mask, <32 x i8> %op1, <32 x i8> %op2
				store <32 x i8> %sel, <32 x i8>* %a
				ret void
				}

				define void @select_v64i8(<64 x i8>* %a, <64 x i8>* %b, <64 x i1>* %c) #0 {
				; CHECK: select_v64i8:
				; CHECK: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,64)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].b
				; VBITS_GE_512: ld1b { [[MASK:z[0-9]+]].b }, [[PG]]/z, [x9]
				; VBITS_GE_512-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: and [[AND:z[0-9]+]].b, [[MASK]].b, #0x1
				; VBITS_GE_512-NEXT: cmpne [[COND:p[0-9]+]].b, [[PG1]]/z, [[AND]].b, #0
				; VBITS_GE_512-NEXT: sel [[RES:z[0-9]+]].b, [[COND]], [[OP1]].b, [[OP2]].b
				; VBITS_GE_512-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_512: ret
				%mask = load <64 x i1>, <64 x i1>* %c
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%sel = select <64 x i1> %mask, <64 x i8> %op1, <64 x i8> %op2
				store <64 x i8> %sel, <64 x i8>* %a
				ret void
				}

				define void @select_v128i8(<128 x i8>* %a, <128 x i8>* %b, <128 x i1>* %c) #0 {
				; CHECK: select_v128i8:
				; CHECK: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,128)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].b
				; VBITS_GE_1024: ld1b { [[MASK:z[0-9]+]].b }, [[PG]]/z, [x9]
				; VBITS_GE_1024-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: and [[AND:z[0-9]+]].b, [[MASK]].b, #0x1
				; VBITS_GE_1024-NEXT: cmpne [[COND:p[0-9]+]].b, [[PG1]]/z, [[AND]].b, #0
				; VBITS_GE_1024-NEXT: sel [[RES:z[0-9]+]].b, [[COND]], [[OP1]].b, [[OP2]].b
				; VBITS_GE_1024-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_1024: ret
				%mask = load <128 x i1>, <128 x i1>* %c
				%op1 = load <128 x i8>, <128 x i8>* %a
				%op2 = load <128 x i8>, <128 x i8>* %b
				%sel = select <128 x i1> %mask, <128 x i8> %op1, <128 x i8> %op2
				store <128 x i8> %sel, <128 x i8>* %a
				ret void
				}

				define void @select_v256i8(<256 x i8>* %a, <256 x i8>* %b, <256 x i1>* %c) #0 {
				; CHECK: select_v256i8:
				; CHECK: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,256)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].b
				; VBITS_GE_2048: ld1b { [[MASK:z[0-9]+]].b }, [[PG]]/z, [x9]
				; VBITS_GE_2048-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: and [[AND:z[0-9]+]].b, [[MASK]].b, #0x1
				; VBITS_GE_2048-NEXT: cmpne [[COND:p[0-9]+]].b, [[PG1]]/z, [[AND]].b, #0
				; VBITS_GE_2048-NEXT: sel [[RES:z[0-9]+]].b, [[COND]], [[OP1]].b, [[OP2]].b
				; VBITS_GE_2048-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_2048: ret
				%mask = load <256 x i1>, <256 x i1>* %c
				%op1 = load <256 x i8>, <256 x i8>* %a
				%op2 = load <256 x i8>, <256 x i8>* %b
				%sel = select <256 x i1> %mask, <256 x i8> %op1, <256 x i8> %op2
				store <256 x i8> %sel, <256 x i8>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <4 x i16> @select_v4i16(<4 x i16> %op1, <4 x i16> %op2, <4 x i1> %mask) #0 {
				; CHECK: select_v4i16:
				; CHECK: bif v0.8b, v1.8b, v2.8b
				; CHECK: ret
				%sel = select <4 x i1> %mask, <4 x i16> %op1, <4 x i16> %op2
				ret <4 x i16> %sel
				}

				; Don't use SVE for 128-bit vectors.
				define <8 x i16> @select_v8i16(<8 x i16> %op1, <8 x i16> %op2, <8 x i1> %mask) #0 {
				; CHECK: select_v8i16:
				; CHECK: bif v0.16b, v1.16b, v2.16b
				; CHECK: ret
				%sel = select <8 x i1> %mask, <8 x i16> %op1, <8 x i16> %op2
				ret <8 x i16> %sel
				}

				define void @select_v16i16(<16 x i16>* %a, <16 x i16>* %b, <16 x i1>* %c) #0 {
				; CHECK: select_v16i16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),16)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].h
				; VBITS_GE_256: ld1h { [[MASK:z[0-9]+]].h }, [[PG]]/z, [x9]
				; VBITS_GE_256-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: and [[AND:z[0-9]+]].h, [[MASK]].h, #0x1
				; VBITS_GE_256-NEXT: cmpne [[COND:p[0-9]+]].h, [[PG1]]/z, [[AND]].h, #0
				; VBITS_GE_256-NEXT: sel [[RES:z[0-9]+]].h, [[COND]], [[OP1]].h, [[OP2]].h
				; VBITS_GE_256-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_256: ret
				%mask = load <16 x i1>, <16 x i1>* %c
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%sel = select <16 x i1> %mask, <16 x i16> %op1, <16 x i16> %op2
				store <16 x i16> %sel, <16 x i16>* %a
				ret void
				}

				define void @select_v32i16(<32 x i16>* %a, <32 x i16>* %b, <32 x i1>* %c) #0 {
				; CHECK: select_v32i16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),32)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].h
				; VBITS_GE_512: ld1h { [[MASK:z[0-9]+]].h }, [[PG]]/z, [x9]
				; VBITS_GE_512-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: and [[AND:z[0-9]+]].h, [[MASK]].h, #0x1
				; VBITS_GE_512-NEXT: cmpne [[COND:p[0-9]+]].h, [[PG1]]/z, [[AND]].h, #0
				; VBITS_GE_512-NEXT: sel [[RES:z[0-9]+]].h, [[COND]], [[OP1]].h, [[OP2]].h
				; VBITS_GE_512-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_512: ret
				%mask = load <32 x i1>, <32 x i1>* %c
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%sel = select <32 x i1> %mask, <32 x i16> %op1, <32 x i16> %op2
				store <32 x i16> %sel, <32 x i16>* %a
				ret void
				}

				define void @select_v64i16(<64 x i16>* %a, <64 x i16>* %b, <64 x i1>* %c) #0 {
				; CHECK: select_v64i16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),64)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].h
				; VBITS_GE_1024: ld1h { [[MASK:z[0-9]+]].h }, [[PG]]/z, [x9]
				; VBITS_GE_1024-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: and [[AND:z[0-9]+]].h, [[MASK]].h, #0x1
				; VBITS_GE_1024-NEXT: cmpne [[COND:p[0-9]+]].h, [[PG1]]/z, [[AND]].h, #0
				; VBITS_GE_1024-NEXT: sel [[RES:z[0-9]+]].h, [[COND]], [[OP1]].h, [[OP2]].h
				; VBITS_GE_1024-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_1024: ret
				%mask = load <64 x i1>, <64 x i1>* %c
				%op1 = load <64 x i16>, <64 x i16>* %a
				%op2 = load <64 x i16>, <64 x i16>* %b
				%sel = select <64 x i1> %mask, <64 x i16> %op1, <64 x i16> %op2
				store <64 x i16> %sel, <64 x i16>* %a
				ret void
				}

				define void @select_v128i16(<128 x i16>* %a, <128 x i16>* %b, <128 x i1>* %c) #0 {
				; CHECK: select_v128i16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),128)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].h
				; VBITS_GE_2048: ld1h { [[MASK:z[0-9]+]].h }, [[PG]]/z, [x9]
				; VBITS_GE_2048-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: and [[AND:z[0-9]+]].h, [[MASK]].h, #0x1
				; VBITS_GE_2048-NEXT: cmpne [[COND:p[0-9]+]].h, [[PG1]]/z, [[AND]].h, #0
				; VBITS_GE_2048-NEXT: sel [[RES:z[0-9]+]].h, [[COND]], [[OP1]].h, [[OP2]].h
				; VBITS_GE_2048-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_2048: ret
				%mask = load <128 x i1>, <128 x i1>* %c
				%op1 = load <128 x i16>, <128 x i16>* %a
				%op2 = load <128 x i16>, <128 x i16>* %b
				%sel = select <128 x i1> %mask, <128 x i16> %op1, <128 x i16> %op2
				store <128 x i16> %sel, <128 x i16>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <2 x i32> @select_v2i32(<2 x i32> %op1, <2 x i32> %op2, <2 x i1> %mask) #0 {
				; CHECK: select_v2i32:
				; CHECK: bif v0.8b, v1.8b, v2.8b
				; CHECK: ret
				%sel = select <2 x i1> %mask, <2 x i32> %op1, <2 x i32> %op2
				ret <2 x i32> %sel
				}

				; Don't use SVE for 128-bit vectors.
				define <4 x i32> @select_v4i32(<4 x i32> %op1, <4 x i32> %op2, <4 x i1> %mask) #0 {
				; CHECK: select_v4i32:
				; CHECK: bif v0.16b, v1.16b, v2.16b
				; CHECK: ret
				%sel = select <4 x i1> %mask, <4 x i32> %op1, <4 x i32> %op2
				ret <4 x i32> %sel
				}

				define void @select_v8i32(<8 x i32>* %a, <8 x i32>* %b, <8 x i1>* %c) #0 {
				; CHECK: select_v8i32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),8)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_256: ld1w { [[MASK:z[0-9]+]].s }, [[PG]]/z, [x9]
				; VBITS_GE_256-NEXT: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: and [[AND:z[0-9]+]].s, [[MASK]].s, #0x1
				; VBITS_GE_256-NEXT: cmpne [[COND:p[0-9]+]].s, [[PG1]]/z, [[AND]].s, #0
				; VBITS_GE_256-NEXT: sel [[RES:z[0-9]+]].s, [[COND]], [[OP1]].s, [[OP2]].s
				; VBITS_GE_256-NEXT: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_GE_256: ret
				%mask = load <8 x i1>, <8 x i1>* %c
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%sel = select <8 x i1> %mask, <8 x i32> %op1, <8 x i32> %op2
				store <8 x i32> %sel, <8 x i32>* %a
				ret void
				}

				define void @select_v16i32(<16 x i32>* %a, <16 x i32>* %b, <16 x i1>* %c) #0 {
				; CHECK: select_v16i32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),16)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_512: ld1w { [[MASK:z[0-9]+]].s }, [[PG]]/z, [x9]
				; VBITS_GE_512-NEXT: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: and [[AND:z[0-9]+]].s, [[MASK]].s, #0x1
				; VBITS_GE_512-NEXT: cmpne [[COND:p[0-9]+]].s, [[PG1]]/z, [[AND]].s, #0
				; VBITS_GE_512-NEXT: sel [[RES:z[0-9]+]].s, [[COND]], [[OP1]].s, [[OP2]].s
				; VBITS_GE_512-NEXT: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_GE_512: ret
				%mask = load <16 x i1>, <16 x i1>* %c
				%op1 = load <16 x i32>, <16 x i32>* %a
				%op2 = load <16 x i32>, <16 x i32>* %b
				%sel = select <16 x i1> %mask, <16 x i32> %op1, <16 x i32> %op2
				store <16 x i32> %sel, <16 x i32>* %a
				ret void
				}

				define void @select_v32i32(<32 x i32>* %a, <32 x i32>* %b, <32 x i1>* %c) #0 {
				; CHECK: select_v32i32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),32)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_1024: ld1w { [[MASK:z[0-9]+]].s }, [[PG]]/z, [x9]
				; VBITS_GE_1024-NEXT: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: and [[AND:z[0-9]+]].s, [[MASK]].s, #0x1
				; VBITS_GE_1024-NEXT: cmpne [[COND:p[0-9]+]].s, [[PG1]]/z, [[AND]].s, #0
				; VBITS_GE_1024-NEXT: sel [[RES:z[0-9]+]].s, [[COND]], [[OP1]].s, [[OP2]].s
				; VBITS_GE_1024-NEXT: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_GE_1024: ret
				%mask = load <32 x i1>, <32 x i1>* %c
				%op1 = load <32 x i32>, <32 x i32>* %a
				%op2 = load <32 x i32>, <32 x i32>* %b
				%sel = select <32 x i1> %mask, <32 x i32> %op1, <32 x i32> %op2
				store <32 x i32> %sel, <32 x i32>* %a
				ret void
				}

				define void @select_v64i32(<64 x i32>* %a, <64 x i32>* %b, <64 x i1>* %c) #0 {
				; CHECK: select_v64i32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),64)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_2048: ld1w { [[MASK:z[0-9]+]].s }, [[PG]]/z, [x9]
				; VBITS_GE_2048-NEXT: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: and [[AND:z[0-9]+]].s, [[MASK]].s, #0x1
				; VBITS_GE_2048-NEXT: cmpne [[COND:p[0-9]+]].s, [[PG1]]/z, [[AND]].s, #0
				; VBITS_GE_2048-NEXT: sel [[RES:z[0-9]+]].s, [[COND]], [[OP1]].s, [[OP2]].s
				; VBITS_GE_2048-NEXT: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_GE_2048: ret
				%mask = load <64 x i1>, <64 x i1>* %c
				%op1 = load <64 x i32>, <64 x i32>* %a
				%op2 = load <64 x i32>, <64 x i32>* %b
				%sel = select <64 x i1> %mask, <64 x i32> %op1, <64 x i32> %op2
				store <64 x i32> %sel, <64 x i32>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <1 x i64> @select_v1i64(<1 x i64> %op1, <1 x i64> %op2, <1 x i1> %mask) #0 {
				; CHECK: select_v1i64:
				; CHECK: bif v0.8b, v1.8b, v2.8b
				; CHECK: ret
				%sel = select <1 x i1> %mask, <1 x i64> %op1, <1 x i64> %op2
				ret <1 x i64> %sel
				}

				; Don't use SVE for 128-bit vectors.
				define <2 x i64> @select_v2i64(<2 x i64> %op1, <2 x i64> %op2, <2 x i1> %mask) #0 {
				; CHECK: select_v2i64:
				; CHECK: bif v0.16b, v1.16b, v2.16b
				; CHECK: ret
				%sel = select <2 x i1> %mask, <2 x i64> %op1, <2 x i64> %op2
				ret <2 x i64> %sel
				}

				define void @select_v4i64(<4 x i64>* %a, <4 x i64>* %b, <4 x i1>* %c) #0 {
				; CHECK: select_v4i64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),4)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].d
				; VBITS_GE_256: ld1d { [[MASK:z[0-9]+]].d }, [[PG]]/z, [x9]
				; VBITS_GE_256-NEXT: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: and [[AND:z[0-9]+]].d, [[MASK]].d, #0x1
				; VBITS_GE_256-NEXT: cmpne [[COND:p[0-9]+]].d, [[PG1]]/z, [[AND]].d, #0
				; VBITS_GE_256-NEXT: sel [[RES:z[0-9]+]].d, [[COND]], [[OP1]].d, [[OP2]].d
				; VBITS_GE_256-NEXT: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_GE_256: ret
				%mask = load <4 x i1>, <4 x i1>* %c
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%sel = select <4 x i1> %mask, <4 x i64> %op1, <4 x i64> %op2
				store <4 x i64> %sel, <4 x i64>* %a
				ret void
				}

				define void @select_v8i64(<8 x i64>* %a, <8 x i64>* %b, <8 x i1>* %c) #0 {
				; CHECK: select_v8i64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),8)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].d
				; VBITS_GE_512: ld1d { [[MASK:z[0-9]+]].d }, [[PG]]/z, [x9]
				; VBITS_GE_512-NEXT: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: and [[AND:z[0-9]+]].d, [[MASK]].d, #0x1
				; VBITS_GE_512-NEXT: cmpne [[COND:p[0-9]+]].d, [[PG1]]/z, [[AND]].d, #0
				; VBITS_GE_512-NEXT: sel [[RES:z[0-9]+]].d, [[COND]], [[OP1]].d, [[OP2]].d
				; VBITS_GE_512-NEXT: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_GE_512: ret
				%mask = load <8 x i1>, <8 x i1>* %c
				%op1 = load <8 x i64>, <8 x i64>* %a
				%op2 = load <8 x i64>, <8 x i64>* %b
				%sel = select <8 x i1> %mask, <8 x i64> %op1, <8 x i64> %op2
				store <8 x i64> %sel, <8 x i64>* %a
				ret void
				}

				define void @select_v16i64(<16 x i64>* %a, <16 x i64>* %b, <16 x i1>* %c) #0 {
				; CHECK: select_v16i64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),16)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].d
				; VBITS_GE_1024: ld1d { [[MASK:z[0-9]+]].d }, [[PG]]/z, [x9]
				; VBITS_GE_1024-NEXT: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: and [[AND:z[0-9]+]].d, [[MASK]].d, #0x1
				; VBITS_GE_1024-NEXT: cmpne [[COND:p[0-9]+]].d, [[PG1]]/z, [[AND]].d, #0
				; VBITS_GE_1024-NEXT: sel [[RES:z[0-9]+]].d, [[COND]], [[OP1]].d, [[OP2]].d
				; VBITS_GE_1024-NEXT: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_GE_1024: ret
				%mask = load <16 x i1>, <16 x i1>* %c
				%op1 = load <16 x i64>, <16 x i64>* %a
				%op2 = load <16 x i64>, <16 x i64>* %b
				%sel = select <16 x i1> %mask, <16 x i64> %op1, <16 x i64> %op2
				store <16 x i64> %sel, <16 x i64>* %a
				ret void
				}

				define void @select_v32i64(<32 x i64>* %a, <32 x i64>* %b, <32 x i1>* %c) #0 {
				; CHECK: select_v32i64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),32)]]
				; CHECK: ptrue [[PG1:p[0-9]+]].d
				; VBITS_GE_2048: ld1d { [[MASK:z[0-9]+]].d }, [[PG]]/z, [x9]
				; VBITS_GE_2048-NEXT: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: and [[AND:z[0-9]+]].d, [[MASK]].d, #0x1
				; VBITS_GE_2048-NEXT: cmpne [[COND:p[0-9]+]].d, [[PG1]]/z, [[AND]].d, #0
				; VBITS_GE_2048-NEXT: sel [[RES:z[0-9]+]].d, [[COND]], [[OP1]].d, [[OP2]].d
				; VBITS_GE_2048-NEXT: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_GE_2048: ret
				%mask = load <32 x i1>, <32 x i1>* %c
				%op1 = load <32 x i64>, <32 x i64>* %a
				%op2 = load <32 x i64>, <32 x i64>* %b
				%sel = select <32 x i1> %mask, <32 x i64> %op1, <32 x i64> %op2
				store <32 x i64> %sel, <32 x i64>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }