This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1/2
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
setcc-type-mismatch.ll
-
vec-combine-compare-to-bitmask.ll
1/3
vec-combine-compare-truncate-store.ll
1
vec_uaddo.ll
1
vec_umulo.ll

Differential D148316

[AArch64] Add support for efficient bitcast in vector truncate store.
ClosedPublic

Authored by lawben on Apr 14 2023, 2:27 AM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma
jaykang10
SjoerdMeijer
Sp00ph

Commits

rGcd68e17bc2f9: [AArch64] Add support for efficient bitcast in vector truncate store.

Summary

Following the changes in D145301, we now also support the efficient bitcast when storing the bool vector. Previously, this was expanded.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lawben created this revision.Apr 14 2023, 2:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 2:27 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

lawben requested review of this revision.Apr 14 2023, 2:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 2:27 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

lawben retitled this revision from Add support for efficient bitcast in vector truncate store. to [AArch64] Add support for efficient bitcast in vector truncate store..Apr 14 2023, 2:31 AM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptApr 14 2023, 2:31 AM

Harbormaster completed remote builds in B225545: Diff 513501.Apr 14 2023, 3:25 AM

dmgreen added inline comments.Apr 18 2023, 12:34 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1223	v8i1? It doesn't appear that these are necessary though, as the trunc stores are already created and the new code goes through a combine. You could alternatively add the call to combineVectorCompareAndTruncateStore to lowerSTORE instead of the combine.
llvm/test/CodeGen/AArch64/vec_uaddo.ll
1	Remove this?
llvm/test/CodeGen/AArch64/vec_umulo.ll
1	Remove this?

dmgreen added reviewers: efriedma, jaykang10, SjoerdMeijer.Apr 18 2023, 12:35 AM

dmgreen added a reviewer: Sp00ph.

lawben added inline comments.Apr 18 2023, 3:23 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1223	regarding moving it to `lowerSTORE`, before I go down that path and change the code. i originally had it in there and then moved it to the "combine" pass, because the truncating store is legalized _after_ the `setcc`. so we have lots of compare logic with intermediate `XOR` etc., which hinders current other passes to fully optimize them out again. we are essentially calling truncate, then sign extend. the sign extend is removed if it knows that `numSignBits` is equal to the element's size. but we lose this information after legalization . if i have this logic applied before legalization, we don't run into that issue. I'm happy to move the code and figure our how to get the optimizations to apply again if `lowerSTORE` is the correct location. if this was just a suggestion, i think leaving it in the combine pass is better. just wanted to get your opinion before implementing it.

Does having sign bits for the vector compare nodes help, as in https://reviews.llvm.org/D148624? I wasn't able to find a good way to test that, the optimizer would always clear away the results for all the examples I tried prior to creating the compare nodes.

In D148316#4277254, @dmgreen wrote:

Does having sign bits for the vector compare nodes help, as in https://reviews.llvm.org/D148624? I wasn't able to find a good way to test that, the optimizer would always clear away the results for all the examples I tried prior to creating the compare nodes.

Unfortunately, it does not. I'll try to dig into this tomorrow when I have some time. I'm not familiar with how the sign extend is removed but if we have the new method in lowerSTORE, there is an xor with a 0-vector, which I'm not sure passes the known sign bits check. I'll let you know once I have some more insights from digging into it. Maybe I can get this to work,

I dug into this a bit, and your vector compare sing bit patch does the right thing here. But the problem is unrelated and a bit annoying. At the time the sign_extend_inreg that we added is combined (to potentially remove it), there is a build_vector with (1 << vectorElementSize) - 1 bits (to negate via xor). But ComputeNumSignBits breaks here. The vector has 32-bit constants but we only have 8-bit vectors. So it detects 24 sign bits and then kinda gives up. So without changing code in SelectionDAG::ComputeNumSignBits or the negation of vector entries, there is no way to correctly determine the sign bits.

If you have any ideas how to overcome this, please let me know. Otherwise, I'd leave this in performSTORECombine to apply it before legalization (vec != 0 --> (vec == 0) ^ 255 ) breaks it.

sgraenitz added a subscriber: sgraenitz.Apr 20 2023, 12:34 AM

In D148316#4280756, @lawben wrote:

I dug into this a bit, and your vector compare sing bit patch does the right thing here. But the problem is unrelated and a bit annoying. At the time the sign_extend_inreg that we added is combined (to potentially remove it), there is a build_vector with (1 << vectorElementSize) - 1 bits (to negate via xor). But ComputeNumSignBits breaks here. The vector has 32-bit constants but we only have 8-bit vectors. So it detects 24 sign bits and then kinda gives up. So without changing code in SelectionDAG::ComputeNumSignBits or the negation of vector entries, there is no way to correctly determine the sign bits.

If you have any ideas how to overcome this, please let me know. Otherwise, I'd leave this in performSTORECombine to apply it before legalization (vec != 0 --> (vec == 0) ^ 255 ) breaks it.

Yeah that sounds OK. There might be a way to fix it by teaching something to truncate constants properly, but many of those things can lead to more and more issues that need to be accounted for. Having this as a combine sounds like it should be fine, they just needs to more careful about illegal types and it looks like those should be accounted for.

I think it can remove the setTruncStoreAction's if it goes through a combine?

lawben mentioned this in D145301: Add more efficient vector bitcast for AArch64.Apr 22 2023, 4:05 AM

Address a few review comments.

@dmgreen If you are concerned about legal types, should I add a check to only perform this before legalization?

Harbormaster completed remote builds in B227444: Diff 516057.Apr 22 2023, 6:12 AM

@dmgreen If you are concerned about legal types, should I add a check to only perform this before legalization?

It would be the other way around, performing the transform after legalization so that we know all types have been legalized. But that runs into the same issues as doing it during lowering. It is worth making sure there are tests for illegal types with odd number vector lanes and non-power2 integer types.

llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll
228	Some of these with low vector lanes are starting to look worse than the code before. The fmov/strb could be done on the fp side, but I don't think that would be enough to make then profitable. Is it worth limiting it to >= 4 vector lanes?

Added check for illegal types. I could not get this check to actually fire. The type is legalized before the bitcast in the bitcast lowering case, and it is legalized before the
store becomes a truncating store. So in both cases other checks prevent bad things from happening. But it is probably still fine to keep it as a defensive check in case things
change.

lawben added inline comments.Apr 26 2023, 2:47 AM

llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll
228	On my M1, it is still faster though. But I agree that this needs a bit of investigation on other ARM CPUs. Suggestion: In a follow-up patch, with the changes to `v16i8` suggested by @efriedma, I'll run a set of benchmarks for the `v16i8` and the `v2i64` (and other) cases on my M1, Graviton 2, Graviton 3, and a Pi 4 (and possibly an A64FX, but that has terrible NEON performance across the board). I have this setup for another project at the moment anyway. Then this should give us a bit of a wider range of performance characteristics. So I'd suggest leaving this as is in this patch and than doing a performance-based follow-up patch. Thoughts?

Harbormaster completed remote builds in B228253: Diff 517109.Apr 26 2023, 3:24 AM

Thanks for checking about the performance. LGTM in that case.

llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll
228	Sure - I was just going from the number of instructions and the extra constant pools. The critical path looks longer (but that might not matter much), and there are less FPR->GPR transfers that will help.

This revision is now accepted and ready to land.Apr 26 2023, 8:59 AM

@dmgreen Thanks for your review. Could you please merge this with "Lawrence Benson <github@lawben.com>".

Oh yes, sorry I keep forgetting. If you have more patches you intended to submit then you could probably ask for commit access. I will commit this one now. Thanks for working on the patch.

Closed by commit rGcd68e17bc2f9: [AArch64] Add support for efficient bitcast in vector truncate store. (authored by lawben, committed by dmgreen). · Explain WhyApr 28 2023, 3:19 AM

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rGcd68e17bc2f9: [AArch64] Add support for efficient bitcast in vector truncate store..

lawben mentioned this in D148624: [AArch64] Add sign bits handling for vector compare nodes.May 2 2023, 2:05 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

55 lines

test/

CodeGen/

AArch64/

setcc-type-mismatch.ll

15 lines

vec-combine-compare-to-bitmask.ll

60 lines

vec-combine-compare-truncate-store.ll

281 lines

vec_uaddo.ll

20 lines

vec_umulo.ll

19 lines

Diff 517855

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,214 Lines • ▼ Show 20 Lines	if (Subtarget->hasNEON()) {
setOperationAction(ISD::BITCAST, MVT::i4, Custom);		setOperationAction(ISD::BITCAST, MVT::i4, Custom);
setOperationAction(ISD::BITCAST, MVT::i8, Custom);		setOperationAction(ISD::BITCAST, MVT::i8, Custom);
setOperationAction(ISD::BITCAST, MVT::i16, Custom);		setOperationAction(ISD::BITCAST, MVT::i16, Custom);

setLoadExtAction(ISD::EXTLOAD, MVT::v4i16, MVT::v4i8, Custom);		setLoadExtAction(ISD::EXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
setLoadExtAction(ISD::SEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);		setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i8, Custom);		setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i8, Custom);
setLoadExtAction(ISD::SEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);
		dmgreenUnsubmitted Not Done Reply Inline Actions v8i1? It doesn't appear that these are necessary though, as the trunc stores are already created and the new code goes through a combine. You could alternatively add the call to combineVectorCompareAndTruncateStore to lowerSTORE instead of the combine. dmgreen: v8i1? It doesn't appear that these are necessary though, as the trunc stores are already…
		lawbenAuthorUnsubmitted Done Reply Inline Actions regarding moving it to `lowerSTORE`, before I go down that path and change the code. i originally had it in there and then moved it to the "combine" pass, because the truncating store is legalized _after_ the `setcc`. so we have lots of compare logic with intermediate `XOR` etc., which hinders current other passes to fully optimize them out again. we are essentially calling truncate, then sign extend. the sign extend is removed if it knows that `numSignBits` is equal to the element's size. but we lose this information after legalization . if i have this logic applied before legalization, we don't run into that issue. I'm happy to move the code and figure our how to get the optimizations to apply again if `lowerSTORE` is the correct location. if this was just a suggestion, i think leaving it in the combine pass is better. just wanted to get your opinion before implementing it. lawben: regarding moving it to `lowerSTORE`, before I go down that path and change the code. i…
setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);		setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);

// ADDP custom lowering		// ADDP custom lowering
for (MVT VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64 })		for (MVT VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64 })
setOperationAction(ISD::ADD, VT, Custom);		setOperationAction(ISD::ADD, VT, Custom);
// FADDP custom lowering		// FADDP custom lowering
for (MVT VT : { MVT::v16f16, MVT::v8f32, MVT::v4f64 })		for (MVT VT : { MVT::v16f16, MVT::v8f32, MVT::v4f64 })
setOperationAction(ISD::FADD, VT, Custom);		setOperationAction(ISD::FADD, VT, Custom);
▲ Show 20 Lines • Show All 18,538 Lines • ▼ Show 20 Lines

// When converting a <N x iX> vector to <N x i1> to store or use as a scalar		// When converting a <N x iX> vector to <N x i1> to store or use as a scalar
// iN, we can use a trick that extracts the i^th bit from the i^th element and		// iN, we can use a trick that extracts the i^th bit from the i^th element and
// then performs a vector add to get a scalar bitmask. This requires that each		// then performs a vector add to get a scalar bitmask. This requires that each
// element's bits are either all 1 or all 0.		// element's bits are either all 1 or all 0.
static SDValue vectorToScalarBitmask(SDNode *N, SelectionDAG &DAG) {		static SDValue vectorToScalarBitmask(SDNode *N, SelectionDAG &DAG) {
SDLoc DL(N);		SDLoc DL(N);
SDValue ComparisonResult(N, 0);		SDValue ComparisonResult(N, 0);
EVT BoolVecVT = ComparisonResult.getValueType();		EVT VecVT = ComparisonResult.getValueType();
assert(BoolVecVT.isVector() && "Must be a vector type");		assert(VecVT.isVector() && "Must be a vector type");

unsigned NumElts = BoolVecVT.getVectorNumElements();		unsigned NumElts = VecVT.getVectorNumElements();
if (NumElts != 2 && NumElts != 4 && NumElts != 8 && NumElts != 16)		if (NumElts != 2 && NumElts != 4 && NumElts != 8 && NumElts != 16)
return SDValue();		return SDValue();

		if (VecVT.getVectorElementType() != MVT::i1 &&
		!DAG.getTargetLoweringInfo().isTypeLegal(VecVT))
		return SDValue();

// If we can find the original types to work on instead of a vector of i1,		// If we can find the original types to work on instead of a vector of i1,
// we can avoid extend/extract conversion instructions.		// we can avoid extend/extract conversion instructions.
EVT VecVT = tryGetOriginalBoolVectorType(ComparisonResult);		if (VecVT.getVectorElementType() == MVT::i1) {
		VecVT = tryGetOriginalBoolVectorType(ComparisonResult);
if (!VecVT.isSimple()) {		if (!VecVT.isSimple()) {
unsigned BitsPerElement = std::max(64 / NumElts, 8u); // min. 64-bit vector		unsigned BitsPerElement = std::max(64 / NumElts, 8u); // >= 64-bit vector
VecVT =		VecVT = MVT::getVectorVT(MVT::getIntegerVT(BitsPerElement), NumElts);
BoolVecVT.changeVectorElementType(MVT::getIntegerVT(BitsPerElement));		}
}		}
VecVT = VecVT.changeVectorElementTypeToInteger();		VecVT = VecVT.changeVectorElementTypeToInteger();

// Large vectors don't map directly to this conversion, so to avoid too many		// Large vectors don't map directly to this conversion, so to avoid too many
// edge cases, we don't apply it here. The conversion will likely still be		// edge cases, we don't apply it here. The conversion will likely still be
// applied later via multiple smaller vectors, whose results are concatenated.		// applied later via multiple smaller vectors, whose results are concatenated.
if (VecVT.getSizeInBits() > 128)		if (VecVT.getSizeInBits() > 128)
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	static SDValue vectorToScalarBitmask(SDNode *N, SelectionDAG &DAG) {
SDValue Mask = DAG.getNode(ISD::BUILD_VECTOR, DL, VecVT, MaskConstants);		SDValue Mask = DAG.getNode(ISD::BUILD_VECTOR, DL, VecVT, MaskConstants);
SDValue RepresentativeBits =		SDValue RepresentativeBits =
DAG.getNode(ISD::AND, DL, VecVT, ComparisonResult, Mask);		DAG.getNode(ISD::AND, DL, VecVT, ComparisonResult, Mask);
EVT ResultVT = MVT::getIntegerVT(std::max<unsigned>(		EVT ResultVT = MVT::getIntegerVT(std::max<unsigned>(
NumElts, VecVT.getVectorElementType().getSizeInBits()));		NumElts, VecVT.getVectorElementType().getSizeInBits()));
return DAG.getNode(ISD::VECREDUCE_ADD, DL, ResultVT, RepresentativeBits);		return DAG.getNode(ISD::VECREDUCE_ADD, DL, ResultVT, RepresentativeBits);
}		}

		static SDValue combineBoolVectorAndTruncateStore(SelectionDAG &DAG,
		StoreSDNode *Store) {
		if (!Store->isTruncatingStore())
		return SDValue();

		SDLoc DL(Store);
		SDValue VecOp = Store->getValue();
		EVT VT = VecOp.getValueType();
		EVT MemVT = Store->getMemoryVT();

		if (!MemVT.isVector() \|\| !VT.isVector() \|\|
		MemVT.getVectorElementType() != MVT::i1)
		return SDValue();

		// If we are storing a vector that we are currently building, let
		// `scalarizeVectorStore()` handle this more efficiently.
		if (VecOp.getOpcode() == ISD::BUILD_VECTOR)
		return SDValue();

		VecOp = DAG.getNode(ISD::TRUNCATE, DL, MemVT, VecOp);
		SDValue VectorBits = vectorToScalarBitmask(VecOp.getNode(), DAG);
		if (!VectorBits)
		return SDValue();

		EVT StoreVT =
		EVT::getIntegerVT(*DAG.getContext(), MemVT.getStoreSizeInBits());
		SDValue ExtendedBits = DAG.getZExtOrTrunc(VectorBits, DL, StoreVT);
		return DAG.getStore(Store->getChain(), DL, ExtendedBits, Store->getBasePtr(),
		Store->getMemOperand());
		}

static SDValue performSTORECombine(SDNode *N,		static SDValue performSTORECombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG,		SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
StoreSDNode *ST = cast<StoreSDNode>(N);		StoreSDNode *ST = cast<StoreSDNode>(N);
SDValue Chain = ST->getChain();		SDValue Chain = ST->getChain();
SDValue Value = ST->getValue();		SDValue Value = ST->getValue();
SDValue Ptr = ST->getBasePtr();		SDValue Ptr = ST->getBasePtr();
Show All 22 Lines	static SDValue performSTORECombine(SDNode *N,

if (Subtarget->supportsAddressTopByteIgnored() &&		if (Subtarget->supportsAddressTopByteIgnored() &&
performTBISimplification(N->getOperand(2), DCI, DAG))		performTBISimplification(N->getOperand(2), DCI, DAG))
return SDValue(N, 0);		return SDValue(N, 0);

if (SDValue Store = foldTruncStoreOfExt(DAG, N))		if (SDValue Store = foldTruncStoreOfExt(DAG, N))
return Store;		return Store;

		if (SDValue Store = combineBoolVectorAndTruncateStore(DAG, ST))
		return Store;

return SDValue();		return SDValue();
}		}

static SDValue performMSTORECombine(SDNode *N,		static SDValue performMSTORECombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG,		SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
MaskedStoreSDNode *MST = cast<MaskedStoreSDNode>(N);		MaskedStoreSDNode *MST = cast<MaskedStoreSDNode>(N);
▲ Show 20 Lines • Show All 5,008 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/setcc-type-mismatch.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-linux-gnu %s -o - \| FileCheck %s			; RUN: llc -mtriple=aarch64-linux-gnu %s -o - \| FileCheck %s

	define void @test_mismatched_setcc(<4 x i22> %l, <4 x i22> %r, ptr %addr) {			define void @test_mismatched_setcc(<4 x i22> %l, <4 x i22> %r, ptr %addr) {
	; CHECK-LABEL: test_mismatched_setcc:			; CHECK-LABEL: test_mismatched_setcc:
	; CHECK: cmeq [[CMP128:v[0-9]+]].4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.4s			; CHECK: // %bb.0:
	; CHECK: xtn {{v[0-9]+}}.4h, [[CMP128]].4s			; CHECK-NEXT: movi v2.4s, #63, msl #16
				; CHECK-NEXT: adrp x8, .LCPI0_0
				; CHECK-NEXT: ldr q3, [x8, :lo12:.LCPI0_0]
				; CHECK-NEXT: and v1.16b, v1.16b, v2.16b
				; CHECK-NEXT: and v0.16b, v0.16b, v2.16b
				; CHECK-NEXT: cmeq v0.4s, v0.4s, v1.4s
				; CHECK-NEXT: and v0.16b, v0.16b, v3.16b
				; CHECK-NEXT: addv s0, v0.4s
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strb w8, [x0]
				; CHECK-NEXT: ret

	%tst = icmp eq <4 x i22> %l, %r			%tst = icmp eq <4 x i22> %l, %r
	store <4 x i1> %tst, ptr %addr			store <4 x i1> %tst, ptr %addr
	ret void			ret void
	}			}

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll

	Show First 20 Lines • Show All 412 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: fmov w0, s0			; CHECK-NEXT: fmov w0, s0
	; CHECK-NEXT: ret			; CHECK-NEXT: ret

	%cmp_result = fcmp one <4 x float> %vec, zeroinitializer			%cmp_result = fcmp one <4 x float> %vec, zeroinitializer
	%bitmask = bitcast <4 x i1> %cmp_result to i4			%bitmask = bitcast <4 x i1> %cmp_result to i4
	ret i4 %bitmask			ret i4 %bitmask
	}			}

	; TODO(lawben): Change this in follow-up patch to #D145301, as truncating stores fix this.			; Larger vector types don't map directly, but the can be split/truncated and then converted.
	; Larger vector types don't map directly.			; After the comparison against 0, this is truncated to <8 x i16>, which is valid again.
	define i8 @no_convert_large_vector(<8 x i32> %vec) {			define i8 @convert_large_vector(<8 x i32> %vec) {
				; CHECK-LABEL: lCPI15_0:
				; CHECK-NEXT: .short 1
				; CHECK-NEXT: .short 2
				; CHECK-NEXT: .short 4
				; CHECK-NEXT: .short 8
				; CHECK-NEXT: .short 16
				; CHECK-NEXT: .short 32
				; CHECK-NEXT: .short 64
				; CHECK-NEXT: .short 128

	; CHECK-LABEL: convert_large_vector:			; CHECK-LABEL: convert_large_vector:
	; CHECK: cmeq.4s v1, v1, #0			; CHECK: Lloh30:
	; CHECK-NOT: addv			; CHECK-NEXT: adrp x8, lCPI15_0@PAGE
				; CHECK-NEXT: cmeq.4s v1, v1, #0
				; CHECK-NEXT: cmeq.4s v0, v0, #0
				; CHECK-NEXT: uzp1.8h v0, v0, v1
				; CHECK-NEXT: Lloh31:
				; CHECK-NEXT: ldr q1, [x8, lCPI15_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: addv.8h h0, v0
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: and w0, w8, #0xff
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret

	%cmp_result = icmp ne <8 x i32> %vec, zeroinitializer			%cmp_result = icmp ne <8 x i32> %vec, zeroinitializer
	%bitmask = bitcast <8 x i1> %cmp_result to i8			%bitmask = bitcast <8 x i1> %cmp_result to i8
	ret i8 %bitmask			ret i8 %bitmask
	}			}

				define i4 @convert_legalized_illegal_element_size(<4 x i22> %vec) {
				; CHECK-LABEL: convert_legalized_illegal_element_size
				; CHECK: ; %bb.0:
				; CHECK-NEXT: movi.4s v1, #63, msl #16
				; CHECK-NEXT: Lloh32:
				; CHECK-NEXT: adrp x8, lCPI16_0@PAGE
				; CHECK-NEXT: cmtst.4s v0, v0, v1
				; CHECK-NEXT: Lloh33:
				; CHECK-NEXT: ldr d1, [x8, lCPI16_0@PAGEOFF]
				; CHECK-NEXT: xtn.4h v0, v0
				; CHECK-NEXT: and.8b v0, v0, v1
				; CHECK-NEXT: addv.4h h0, v0
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <4 x i22> %vec, zeroinitializer
				%bitmask = bitcast <4 x i1> %cmp_result to i4
				ret i4 %bitmask
				}

	; This may still be converted as a v8i8 after the vector concat (but not as v4iX).			; This may still be converted as a v8i8 after the vector concat (but not as v4iX).
	define i8 @no_direct_convert_for_bad_concat(<4 x i32> %vec) {			define i8 @no_direct_convert_for_bad_concat(<4 x i32> %vec) {
	; CHECK-LABEL: no_direct_convert_for_bad_concat:			; CHECK-LABEL: no_direct_convert_for_bad_concat:
	; CHECK: cmtst.4s v0, v0, v0			; CHECK: cmtst.4s v0, v0, v0
	; CHECK-NOT: addv.4			; CHECK-NOT: addv.4

	%cmp_result = icmp ne <4 x i32> %vec, zeroinitializer			%cmp_result = icmp ne <4 x i32> %vec, zeroinitializer
	%vector_pad = shufflevector <4 x i1> poison, <4 x i1> %cmp_result, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 4, i32 5, i32 6, i32 7>			%vector_pad = shufflevector <4 x i1> poison, <4 x i1> %cmp_result, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 4, i32 5, i32 6, i32 7>
	%bitmask = bitcast <8 x i1> %vector_pad to i8			%bitmask = bitcast <8 x i1> %vector_pad to i8
	ret i8 %bitmask			ret i8 %bitmask
	}			}

	define <8 x i1> @no_convert_without_direct_bitcast(<8 x i16> %vec) {			define <8 x i1> @no_convert_without_direct_bitcast(<8 x i16> %vec) {
	; CHECK-LABEL: no_convert_without_direct_bitcast:			; CHECK-LABEL: no_convert_without_direct_bitcast:
	; CHECK: cmtst.8h v0, v0, v0			; CHECK: cmtst.8h v0, v0, v0
	; CHECK-NOT: addv.4s s0, v0			; CHECK-NOT: addv.4s s0, v0

	%cmp_result = icmp ne <8 x i16> %vec, zeroinitializer			%cmp_result = icmp ne <8 x i16> %vec, zeroinitializer
	ret <8 x i1> %cmp_result			ret <8 x i1> %cmp_result
	}			}

				define i6 @no_combine_illegal_num_elements(<6 x i32> %vec) {
				; CHECK-LABEL: no_combine_illegal_num_elements
				; CHECK-NOT: addv

				%cmp_result = icmp ne <6 x i32> %vec, zeroinitializer
				%bitmask = bitcast <6 x i1> %cmp_result to i6
				ret i6 %bitmask
				}

llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll

This file was added.

				; RUN: llc -mtriple=aarch64-apple-darwin -mattr=+neon -verify-machineinstrs < %s \| FileCheck %s

				define void @store_16_elements(<16 x i8> %vec, ptr %out) {
				; Bits used in mask
				; CHECK-LABEL: lCPI0_0
				; CHECK-NEXT: .byte 1
				; CHECK-NEXT: .byte 2
				; CHECK-NEXT: .byte 4
				; CHECK-NEXT: .byte 8
				; CHECK-NEXT: .byte 16
				; CHECK-NEXT: .byte 32
				; CHECK-NEXT: .byte 64
				; CHECK-NEXT: .byte 128
				; CHECK-NEXT: .byte 1
				; CHECK-NEXT: .byte 2
				; CHECK-NEXT: .byte 4
				; CHECK-NEXT: .byte 8
				; CHECK-NEXT: .byte 16
				; CHECK-NEXT: .byte 32
				; CHECK-NEXT: .byte 64
				; CHECK-NEXT: .byte 128

				; Actual conversion
				; CHECK-LABEL: store_16_elements
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh0:
				; CHECK-NEXT: adrp x8, lCPI0_0@PAGE
				; CHECK-NEXT: cmeq.16b v0, v0, #0
				; CHECK-NEXT: Lloh1:
				; CHECK-NEXT: ldr q1, [x8, lCPI0_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: ext.16b v1, v0, v0, #8
				; CHECK-NEXT: addv.8b b0, v0
				; CHECK-NEXT: addv.8b b1, v1
				; CHECK-NEXT: fmov w9, s0
				; CHECK-NEXT: fmov w8, s1
				; CHECK-NEXT: orr w8, w9, w8, lsl #8
				; CHECK-NEXT: strh w8, [x0]
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <16 x i8> %vec, zeroinitializer
				store <16 x i1> %cmp_result, ptr %out
				ret void
				}

				define void @store_8_elements(<8 x i16> %vec, ptr %out) {
				; CHECK-LABEL: lCPI1_0:
				; CHECK-NEXT: .short 1
				; CHECK-NEXT: .short 2
				; CHECK-NEXT: .short 4
				; CHECK-NEXT: .short 8
				; CHECK-NEXT: .short 16
				; CHECK-NEXT: .short 32
				; CHECK-NEXT: .short 64
				; CHECK-NEXT: .short 128

				; CHECK-LABEL: store_8_elements
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh2:
				; CHECK-NEXT: adrp x8, lCPI1_0@PAGE
				; CHECK-NEXT: cmeq.8h v0, v0, #0
				; CHECK-NEXT: Lloh3:
				; CHECK-NEXT: ldr q1, [x8, lCPI1_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: addv.8h h0, v0
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strb w8, [x0]
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <8 x i16> %vec, zeroinitializer
				store <8 x i1> %cmp_result, ptr %out
				ret void
				}

				define void @store_4_elements(<4 x i32> %vec, ptr %out) {
				; CHECK-LABEL: lCPI2_0:
				; CHECK-NEXT: .long 1
				; CHECK-NEXT: .long 2
				; CHECK-NEXT: .long 4
				; CHECK-NEXT: .long 8

				; CHECK-LABEL: store_4_elements
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh4:
				; CHECK-NEXT: adrp x8, lCPI2_0@PAGE
				; CHECK-NEXT: cmeq.4s v0, v0, #0
				; CHECK-NEXT: Lloh5:
				; CHECK-NEXT: ldr q1, [x8, lCPI2_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: addv.4s s0, v0
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strb w8, [x0]
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <4 x i32> %vec, zeroinitializer
				store <4 x i1> %cmp_result, ptr %out
				ret void
				}

				define void @store_2_elements(<2 x i64> %vec, ptr %out) {
				; CHECK-LABEL: lCPI3_0:
				; CHECK-NEXT: .quad 1
				; CHECK-NEXT: .quad 2

				; CHECK-LABEL: store_2_elements
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh6:
				; CHECK-NEXT: adrp x8, lCPI3_0@PAGE
				; CHECK-NEXT: cmeq.2d v0, v0, #0
				; CHECK-NEXT: Lloh7:
				; CHECK-NEXT: ldr q1, [x8, lCPI3_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: addp.2d d0, v0
				; CHECK-NEXT: fmov x8, d0
				; CHECK-NEXT: strb w8, [x0]
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <2 x i64> %vec, zeroinitializer
				store <2 x i1> %cmp_result, ptr %out
				ret void
				}

				define void @add_trunc_compare_before_store(<4 x i32> %vec, ptr %out) {
				; CHECK-LABEL: lCPI4_0:
				; CHECK-NEXT: .long 1
				; CHECK-NEXT: .long 2
				; CHECK-NEXT: .long 4
				; CHECK-NEXT: .long 8

				; CHECK-LABEL: add_trunc_compare_before_store
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh8:
				; CHECK-NEXT: adrp x8, lCPI4_0@PAGE
				; CHECK-NEXT: shl.4s v0, v0, #31
				; CHECK-NEXT: cmlt.4s v0, v0, #0
				; CHECK-NEXT: Lloh9:
				; CHECK-NEXT: ldr q1, [x8, lCPI4_0@PAGEOFF]
				; CHECK-NEXT: and.16b v0, v0, v1
				; CHECK-NEXT: addv.4s s0, v0
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strb w8, [x0]
				; CHECK-NEXT: ret

				%trunc = trunc <4 x i32> %vec to <4 x i1>
				store <4 x i1> %trunc, ptr %out
				ret void
				}

				define void @add_trunc_mask_unknown_vector_type(<4 x i1> %vec, ptr %out) {
				; CHECK-LABEL: lCPI5_0:
				; CHECK: .short 1
				; CHECK: .short 2
				; CHECK: .short 4
				; CHECK: .short 8

				; CHECK-LABEL: add_trunc_mask_unknown_vector_type
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh10:
				; CHECK-NEXT: adrp x8, lCPI5_0@PAGE
				; CHECK-NEXT: shl.4h v0, v0, #15
				; CHECK-NEXT: cmlt.4h v0, v0, #0
				; CHECK-NEXT: Lloh11:
				; CHECK-NEXT: ldr d1, [x8, lCPI5_0@PAGEOFF]
				; CHECK-NEXT: and.8b v0, v0, v1
				; CHECK-NEXT: addv.4h h0, v0
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strb w8, [x0]
				; CHECK-NEXT: ret

				store <4 x i1> %vec, ptr %out
				ret void
				}

				define void @store_8_elements_64_bit_vector(<8 x i8> %vec, ptr %out) {
				; CHECK-LABEL: lCPI6_0:
				; CHECK-NEXT: .byte 1
				; CHECK-NEXT: .byte 2
				; CHECK-NEXT: .byte 4
				; CHECK-NEXT: .byte 8
				; CHECK-NEXT: .byte 16
				; CHECK-NEXT: .byte 32
				; CHECK-NEXT: .byte 64
				; CHECK-NEXT: .byte 128

				; CHECK-LABEL: store_8_elements_64_bit_vector
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh12:
				; CHECK-NEXT: adrp x8, lCPI6_0@PAGE
				; CHECK-NEXT: cmeq.8b v0, v0, #0
				; CHECK-NEXT: Lloh13:
				; CHECK-NEXT: ldr d1, [x8, lCPI6_0@PAGEOFF]
				; CHECK-NEXT: bic.8b v0, v1, v0
				; CHECK-NEXT: addv.8b b0, v0
				; CHECK-NEXT: st1.b { v0 }[0], [x0]
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <8 x i8> %vec, zeroinitializer
				store <8 x i1> %cmp_result, ptr %out
				ret void
				}

				define void @store_4_elements_64_bit_vector(<4 x i16> %vec, ptr %out) {
				; CHECK-LABEL: lCPI7_0:
				; CHECK-NEXT: .short 1
				; CHECK-NEXT: .short 2
				; CHECK-NEXT: .short 4
				; CHECK-NEXT: .short 8

				; CHECK-LABEL: store_4_elements_64_bit_vector
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh14:
				; CHECK-NEXT: adrp x8, lCPI7_0@PAGE
				; CHECK-NEXT: cmeq.4h v0, v0, #0
				; CHECK-NEXT: Lloh15:
				; CHECK-NEXT: ldr d1, [x8, lCPI7_0@PAGEOFF]
				; CHECK-NEXT: bic.8b v0, v1, v0
				; CHECK-NEXT: addv.4h h0, v0
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strb w8, [x0]
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <4 x i16> %vec, zeroinitializer
				store <4 x i1> %cmp_result, ptr %out
				ret void
				}

				define void @store_2_elements_64_bit_vector(<2 x i32> %vec, ptr %out) {
				; CHECK-LABEL: lCPI8_0:
				dmgreenUnsubmitted Not Done Reply Inline Actions Some of these with low vector lanes are starting to look worse than the code before. The fmov/strb could be done on the fp side, but I don't think that would be enough to make then profitable. Is it worth limiting it to >= 4 vector lanes? dmgreen: Some of these with low vector lanes are starting to look worse than the code before. The…
				lawbenAuthorUnsubmitted Done Reply Inline Actions On my M1, it is still faster though. But I agree that this needs a bit of investigation on other ARM CPUs. Suggestion: In a follow-up patch, with the changes to `v16i8` suggested by @efriedma, I'll run a set of benchmarks for the `v16i8` and the `v2i64` (and other) cases on my M1, Graviton 2, Graviton 3, and a Pi 4 (and possibly an A64FX, but that has terrible NEON performance across the board). I have this setup for another project at the moment anyway. Then this should give us a bit of a wider range of performance characteristics. So I'd suggest leaving this as is in this patch and than doing a performance-based follow-up patch. Thoughts? lawben: On my M1, it is still faster though. But I agree that this needs a bit of investigation on…
				dmgreenUnsubmitted Not Done Reply Inline Actions Sure - I was just going from the number of instructions and the extra constant pools. The critical path looks longer (but that might not matter much), and there are less FPR->GPR transfers that will help. dmgreen: Sure - I was just going from the number of instructions and the extra constant pools. The…
				; CHECK-NEXT: .long 1
				; CHECK-NEXT: .long 2

				; CHECK-LABEL: store_2_elements_64_bit_vector
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh16:
				; CHECK-NEXT: adrp x8, lCPI8_0@PAGE
				; CHECK-NEXT: cmeq.2s v0, v0, #0
				; CHECK-NEXT: Lloh17:
				; CHECK-NEXT: ldr d1, [x8, lCPI8_0@PAGEOFF]
				; CHECK-NEXT: bic.8b v0, v1, v0
				; CHECK-NEXT: addp.2s v0, v0, v0
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strb w8, [x0]
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <2 x i32> %vec, zeroinitializer
				store <2 x i1> %cmp_result, ptr %out
				ret void
				}

				define void @no_combine_without_truncate(<16 x i8> %vec, ptr %out) {
				; CHECK-LABEL: no_combine_without_truncate
				; CHECK: cmtst.16b v0, v0, v0
				; CHECK-NOT: addv.8b b0, v0

				%cmp_result = icmp ne <16 x i8> %vec, zeroinitializer
				%extended_result = sext <16 x i1> %cmp_result to <16 x i8>
				store <16 x i8> %extended_result, ptr %out
				ret void
				}

				define void @no_combine_for_non_bool_truncate(<4 x i32> %vec, ptr %out) {
				; CHECK-LABEL: no_combine_for_non_bool_truncate
				; CHECK: xtn.4h v0, v0
				; CHECK-NOT: addv.4s s0, v0

				%trunc = trunc <4 x i32> %vec to <4 x i8>
				store <4 x i8> %trunc, ptr %out
				ret void
				}

				define void @no_combine_for_build_vector(i1 %a, i1 %b, i1 %c, i1 %d, ptr %out) {
				; CHECK-LABEL: no_combine_for_build_vector
				; CHECK-NOT: addv

				%1 = insertelement <4 x i1> undef, i1 %a, i64 0
				%2 = insertelement <4 x i1> %1, i1 %b, i64 1
				%3 = insertelement <4 x i1> %2, i1 %c, i64 2
				%vec = insertelement <4 x i1> %3, i1 %d, i64 3
				store <4 x i1> %vec, ptr %out
				ret void
				}

llvm/test/CodeGen/AArch64/vec_uaddo.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
		dmgreenUnsubmitted Not Done Reply Inline Actions Remove this? dmgreen: Remove this?
; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -mattr=+neon \| FileCheck %s --check-prefix=CHECK		; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -mattr=+neon \| FileCheck %s --check-prefix=CHECK

declare {<1 x i32>, <1 x i1>} @llvm.uadd.with.overflow.v1i32(<1 x i32>, <1 x i32>)		declare {<1 x i32>, <1 x i1>} @llvm.uadd.with.overflow.v1i32(<1 x i32>, <1 x i32>)
declare {<2 x i32>, <2 x i1>} @llvm.uadd.with.overflow.v2i32(<2 x i32>, <2 x i32>)		declare {<2 x i32>, <2 x i1>} @llvm.uadd.with.overflow.v2i32(<2 x i32>, <2 x i32>)
declare {<3 x i32>, <3 x i1>} @llvm.uadd.with.overflow.v3i32(<3 x i32>, <3 x i32>)		declare {<3 x i32>, <3 x i1>} @llvm.uadd.with.overflow.v3i32(<3 x i32>, <3 x i32>)
declare {<4 x i32>, <4 x i1>} @llvm.uadd.with.overflow.v4i32(<4 x i32>, <4 x i32>)		declare {<4 x i32>, <4 x i1>} @llvm.uadd.with.overflow.v4i32(<4 x i32>, <4 x i32>)
declare {<6 x i32>, <6 x i1>} @llvm.uadd.with.overflow.v6i32(<6 x i32>, <6 x i32>)		declare {<6 x i32>, <6 x i1>} @llvm.uadd.with.overflow.v6i32(<6 x i32>, <6 x i32>)
declare {<8 x i32>, <8 x i1>} @llvm.uadd.with.overflow.v8i32(<8 x i32>, <8 x i32>)		declare {<8 x i32>, <8 x i1>} @llvm.uadd.with.overflow.v8i32(<8 x i32>, <8 x i32>)
▲ Show 20 Lines • Show All 231 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
store <4 x i24> %val, ptr %p2		store <4 x i24> %val, ptr %p2
ret <4 x i32> %res		ret <4 x i32> %res
}		}

define <4 x i32> @uaddo_v4i1(<4 x i1> %a0, <4 x i1> %a1, ptr %p2) nounwind {		define <4 x i32> @uaddo_v4i1(<4 x i1> %a0, <4 x i1> %a1, ptr %p2) nounwind {
; CHECK-LABEL: uaddo_v4i1:		; CHECK-LABEL: uaddo_v4i1:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi v2.4h, #1		; CHECK-NEXT: movi v2.4h, #1
		; CHECK-NEXT: adrp x8, .LCPI10_0
		; CHECK-NEXT: ldr d3, [x8, :lo12:.LCPI10_0]
; CHECK-NEXT: and v1.8b, v1.8b, v2.8b		; CHECK-NEXT: and v1.8b, v1.8b, v2.8b
; CHECK-NEXT: and v0.8b, v0.8b, v2.8b		; CHECK-NEXT: and v0.8b, v0.8b, v2.8b
; CHECK-NEXT: add v0.4h, v0.4h, v1.4h		; CHECK-NEXT: add v0.4h, v0.4h, v1.4h
; CHECK-NEXT: umov w8, v0.h[0]		; CHECK-NEXT: shl v1.4h, v0.4h, #15
; CHECK-NEXT: umov w9, v0.h[1]		; CHECK-NEXT: and v2.8b, v0.8b, v2.8b
; CHECK-NEXT: umov w10, v0.h[2]		; CHECK-NEXT: cmeq v0.4h, v2.4h, v0.4h
; CHECK-NEXT: umov w11, v0.h[3]		; CHECK-NEXT: cmlt v1.4h, v1.4h, #0
; CHECK-NEXT: and v1.8b, v0.8b, v2.8b
; CHECK-NEXT: cmeq v0.4h, v1.4h, v0.4h
; CHECK-NEXT: and w8, w8, #0x1
; CHECK-NEXT: bfi w8, w9, #1, #1
; CHECK-NEXT: mvn v0.8b, v0.8b		; CHECK-NEXT: mvn v0.8b, v0.8b
; CHECK-NEXT: bfi w8, w10, #2, #1		; CHECK-NEXT: and v1.8b, v1.8b, v3.8b
; CHECK-NEXT: orr w8, w8, w11, lsl #3		; CHECK-NEXT: addv h1, v1.4h
; CHECK-NEXT: and w8, w8, #0xf
; CHECK-NEXT: sshll v0.4s, v0.4h, #0		; CHECK-NEXT: sshll v0.4s, v0.4h, #0
		; CHECK-NEXT: fmov w8, s1
; CHECK-NEXT: strb w8, [x0]		; CHECK-NEXT: strb w8, [x0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%t = call {<4 x i1>, <4 x i1>} @llvm.uadd.with.overflow.v4i1(<4 x i1> %a0, <4 x i1> %a1)		%t = call {<4 x i1>, <4 x i1>} @llvm.uadd.with.overflow.v4i1(<4 x i1> %a0, <4 x i1> %a1)
%val = extractvalue {<4 x i1>, <4 x i1>} %t, 0		%val = extractvalue {<4 x i1>, <4 x i1>} %t, 0
%obit = extractvalue {<4 x i1>, <4 x i1>} %t, 1		%obit = extractvalue {<4 x i1>, <4 x i1>} %t, 1
%res = sext <4 x i1> %obit to <4 x i32>		%res = sext <4 x i1> %obit to <4 x i32>
store <4 x i1> %val, ptr %p2		store <4 x i1> %val, ptr %p2
ret <4 x i32> %res		ret <4 x i32> %res
Show All 26 Lines

llvm/test/CodeGen/AArch64/vec_umulo.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
		dmgreenUnsubmitted Not Done Reply Inline Actions Remove this? dmgreen: Remove this?
; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -mattr=+neon \| FileCheck %s --check-prefix=CHECK		; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -mattr=+neon \| FileCheck %s --check-prefix=CHECK

declare {<1 x i32>, <1 x i1>} @llvm.umul.with.overflow.v1i32(<1 x i32>, <1 x i32>)		declare {<1 x i32>, <1 x i1>} @llvm.umul.with.overflow.v1i32(<1 x i32>, <1 x i32>)
declare {<2 x i32>, <2 x i1>} @llvm.umul.with.overflow.v2i32(<2 x i32>, <2 x i32>)		declare {<2 x i32>, <2 x i1>} @llvm.umul.with.overflow.v2i32(<2 x i32>, <2 x i32>)
declare {<3 x i32>, <3 x i1>} @llvm.umul.with.overflow.v3i32(<3 x i32>, <3 x i32>)		declare {<3 x i32>, <3 x i1>} @llvm.umul.with.overflow.v3i32(<3 x i32>, <3 x i32>)
declare {<4 x i32>, <4 x i1>} @llvm.umul.with.overflow.v4i32(<4 x i32>, <4 x i32>)		declare {<4 x i32>, <4 x i1>} @llvm.umul.with.overflow.v4i32(<4 x i32>, <4 x i32>)
declare {<6 x i32>, <6 x i1>} @llvm.umul.with.overflow.v6i32(<6 x i32>, <6 x i32>)		declare {<6 x i32>, <6 x i1>} @llvm.umul.with.overflow.v6i32(<6 x i32>, <6 x i32>)
declare {<8 x i32>, <8 x i1>} @llvm.umul.with.overflow.v8i32(<8 x i32>, <8 x i32>)		declare {<8 x i32>, <8 x i1>} @llvm.umul.with.overflow.v8i32(<8 x i32>, <8 x i32>)
▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%res = sext <4 x i1> %obit to <4 x i32>		%res = sext <4 x i1> %obit to <4 x i32>
store <4 x i24> %val, ptr %p2		store <4 x i24> %val, ptr %p2
ret <4 x i32> %res		ret <4 x i32> %res
}		}

define <4 x i32> @umulo_v4i1(<4 x i1> %a0, <4 x i1> %a1, ptr %p2) nounwind {		define <4 x i32> @umulo_v4i1(<4 x i1> %a0, <4 x i1> %a1, ptr %p2) nounwind {
; CHECK-LABEL: umulo_v4i1:		; CHECK-LABEL: umulo_v4i1:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmov d2, d0		; CHECK-NEXT: adrp x8, .LCPI10_0
		; CHECK-NEXT: and v0.8b, v0.8b, v1.8b
		; CHECK-NEXT: shl v0.4h, v0.4h, #15
		; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI10_0]
		; CHECK-NEXT: cmlt v0.4h, v0.4h, #0
		; CHECK-NEXT: and v0.8b, v0.8b, v1.8b
		; CHECK-NEXT: addv h1, v0.4h
; CHECK-NEXT: movi v0.2d, #0000000000000000		; CHECK-NEXT: movi v0.2d, #0000000000000000
; CHECK-NEXT: and v1.8b, v2.8b, v1.8b		; CHECK-NEXT: fmov w8, s1
; CHECK-NEXT: umov w8, v1.h[0]
; CHECK-NEXT: umov w9, v1.h[1]
; CHECK-NEXT: umov w10, v1.h[2]
; CHECK-NEXT: umov w11, v1.h[3]
; CHECK-NEXT: and w8, w8, #0x1
; CHECK-NEXT: bfi w8, w9, #1, #1
; CHECK-NEXT: bfi w8, w10, #2, #1
; CHECK-NEXT: orr w8, w8, w11, lsl #3
; CHECK-NEXT: and w8, w8, #0xf
; CHECK-NEXT: strb w8, [x0]		; CHECK-NEXT: strb w8, [x0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%t = call {<4 x i1>, <4 x i1>} @llvm.umul.with.overflow.v4i1(<4 x i1> %a0, <4 x i1> %a1)		%t = call {<4 x i1>, <4 x i1>} @llvm.umul.with.overflow.v4i1(<4 x i1> %a0, <4 x i1> %a1)
%val = extractvalue {<4 x i1>, <4 x i1>} %t, 0		%val = extractvalue {<4 x i1>, <4 x i1>} %t, 0
%obit = extractvalue {<4 x i1>, <4 x i1>} %t, 1		%obit = extractvalue {<4 x i1>, <4 x i1>} %t, 1
%res = sext <4 x i1> %obit to <4 x i32>		%res = sext <4 x i1> %obit to <4 x i32>
store <4 x i1> %val, ptr %p2		store <4 x i1> %val, ptr %p2
ret <4 x i32> %res		ret <4 x i32> %res
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines