This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
36/56
AArch64TargetTransformInfo.cpp
-
test/
-
CodeGen/AArch64/
-
AArch64/
-
sve-intrinsics-perm-select.ll
-
Transforms/InstCombine/AArch64/
-
InstCombine/
-
AArch64/
13/16
sve-intrinsic-dupqlane.ll

Differential D138203

[AArch64][InstCombine] Simplify repeated complex patterns in dupqlane
ClosedPublic

Authored by MattDevereau on Nov 17 2022, 5:35 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
peterwaller-arm
c-rhodes
benmxwl-arm
dtemirbulatov
sdesmalen

Summary

[AArch64][InstCombine] Simplify repeated complex patterns in dupqlane

Repeated floating-point complex patterns in dupqlane such as (f32 a, f32 b, f32 a, f32 b) can be simplified to shufflevector(f64(a, b), undef, 0)

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,230 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vloxseg.c
	60,160 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vluxseg.c
	60,210 ms	x64 debian > Clang.Driver::arm-cortex-cpus-1.c
	60,220 ms	x64 debian > Clang.Driver::arm-cortex-cpus-2.c
	60,070 ms	x64 debian > Clang.Driver::crash-report.cpp

Event Timeline

MattDevereau created this revision.Nov 17 2022, 5:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 17 2022, 5:35 AM

Herald added subscribers: nlopes, hiraditya, kristof.beyls. · View Herald Transcript

MattDevereau requested review of this revision.Nov 17 2022, 5:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 17 2022, 5:35 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

This is essentially an InstCombine version of the work done in https://reviews.llvm.org/D133116.

MattDevereau mentioned this in D133116: [AArch64][SVE] Optimise repeated floating-point complex patterns to splat.Nov 17 2022, 5:37 AM

nlopes added inline comments.Nov 17 2022, 5:42 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1435	Please use PoisoValue here and whenever possible. We are trying to remove undef from LLVM. Actually, here you can even use a different API: `CreateInsertElement(Type VecTy, Value NewElt, Value *Idx)` Thank you!

Harbormaster completed remote builds in B198191: Diff 476099.Nov 17 2022, 6:26 AM

Matt added a subscriber: Matt.Nov 17 2022, 8:45 AM

LGTM, but please address the comment about poison and allow time for other reviewers to chime in.

This revision is now accepted and ready to land.Dec 13 2022, 1:26 AM

peterwaller-arm added a reviewer: sdesmalen.Dec 13 2022, 1:26 AM

tschuett added a subscriber: tschuett.Dec 13 2022, 1:29 AM

tschuett added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1384	Please use `std::optional`. The LLVM variant is going away.
1428	Please use `std::nullopt`.

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716

Could you maybe update the commit message, it seems to have the title twice :)

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1390	nit: VecTy->getScalarType() can be propagated into the few uses below.
1400–1402	nit: This can be moved to the start of the function, so that it bails out immediately if the conditions are not as expected.
1401	Should it also bail out if the vector being inserted is not a FixedLengthVector?
1408	nit: LLVM's coding style uses capitalised variable names, so `I` in this case.
1408	Does this algorithm work if the vector being inserted is not a power-of-two vector? (I guess when it is inserting into an 'undef' vector, you could still consider the other values to be anything you like and the algorithm may still work). Could you also add a test-case for this?
1425–1428	This is a little bit restrictive, because it could also work for e.g. <16 x i8> <a, b, c, d, a, b, c, d, a, b, c, d, a, b, c, d> where only `<a, b, c, d>` would need to be splat. It might help instead to find the 'minimum' set by recursively halving the vector and seeing if all elements match. e.g. <a, b, a, b, a, b, a, b> => <a, b, a, b> == <a, b, a, b> => <a, b> == <a, b> so that the minimum set to splat is `<a, b>`
llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
87	This case could still work right? If the two elements that are missing are both undef, they could be anything including `a, b`.

MattDevereau updated this revision to Diff 486000.Jan 3 2023, 8:08 AM

MattDevereau added inline comments.Jan 3 2023, 8:19 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1425–1428	I've implemented a recursive function which now handles `<a, b, c, d, a, b, c, d>`
llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
87	The case can definitely work, however when re-implenting this patch as a recursive algorithm I ran into a few headaches when trying to integrate null pointers/poision values into the recursion. It is entirely possible to do this, however if there is minimal gain for the time required this I'd suggest it be done as a separate patch. Some cases I ran into issues with were how to handle cases such as: `<a, b, c, nullptr, a, b, c, d>`, where nullptr respresents poison elements. Logically you'd want to pick the right half as a pattern as it has no undefined values, but things start getting complicated with cases such as `<a, b, nullptr, nullptr, nullptr, nullptr, nullptr, d>`, and `<a, nullptr, a, nullptr, nullptr, b, nullptr, b>` It should be possible to simplify these, however I suspect it would be easier to write a separate algorithm from what I've done here to handle poison cases.

Harbormaster completed remote builds in B205471: Diff 486000.Jan 3 2023, 9:12 AM

sdesmalen added inline comments.Jan 4 2023, 3:21 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437	It's better to take and return an ArrayRef<Value*>, rather than a SmallVector object.
1438	nit: you can drop the `std::` here.
1443	if the code is not going to modify the content of the array, it's better to use ArrayRef to avoid creating a new object and copying (existing) data.
1476–1477	When the loop `break`s here, it will leave the other elements in InsertEltVec as `nullptr` from which you seem to infer `poison` or `undef` in SimplifyValuePattern. That's not always correct, because the IR could be inserting into some other vector, e.g. `<4 x i32> %arg` where these values are defined (it would be good to have a test for this)
llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
87	I wouldn't expect the undef/poison case to be that difficult to handle. E.g. for `<a, b, _, _, _, _, _, d>` it could compare `<a, b, _, _>` with `<_, _, _, d>`, which would match the most specific values `<a, b, _, d>`. For `<a, _, a, _, _, b, _, b>` it could compare `<a, _, a, _>` with `<_, b, _, b>` which would match `<a, b, a, b>`, which could then be further broken down to `<a, b>`.

MattDevereau updated this revision to Diff 486308.Jan 4 2023, 8:46 AM

MattDevereau marked 2 inline comments as done.Jan 4 2023, 8:58 AM

MattDevereau added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437	With the addition of allowing inserts into poison vectors to match patterns, the `SmallVector` now has to be mutable in order to replace poison values.
1443	With the addition of allowing inserts into poison vectors to match patterns, the `SmallVector` now has to be mutable in order to replace poison values.
1476–1477	This should now be ok with the inserting into poison logic added. I added the test `dupq_f16_ab_pattern_no_end_indices_not_poison` which I believe should cover this.
llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
87	I had an "aha" moment when reading this comment. In the previous diff revision I was doing the recursive algorithm from the bottom up - i.e. splitting the vectors into two, passing them down the recursion chain, doing the comparison logic, and then passing it back up the recursion chain for further comparison which was causing a lot of strife with more complicated patterns and poison values. Instead it is much simpler to approach it from a top-down angle where the nullptr/poison/_ values get replaced immediately and the recursion call is in the return statement. The tests `dupq_f16_ab_pattern_no_front_indices`, `dupq_f16_ab_pattern_no_end_indices` and `dupq_f16_ab_pattern_no_end_indices` should show this in action.

Harbormaster completed remote builds in B205717: Diff 486308.Jan 4 2023, 10:20 AM

sdesmalen added inline comments.Jan 5 2023, 8:23 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437	It's better to not pass/return large objects by value. If you can't use ArrayRef because you have to modify the values, perhaps you can pass it by reference, and make it both the input and output array (for example by truncating the result)
1438	You marked this as Done, but it seems unchanged.
1483	What happens if there is an insert into the same lane? e.g. %1 = insertelement <8 x half> %0, half %x, i64 1 %2 = insertelement <8 x half> %1, half %y, i64 1
1500–1501	Can this ever happen? I would expect the values in Pattern to always be defined at this point.
1503	At this point we know the Pattern.size() > 1, so can you hoist this out of the loop, and start the loop counter `I` at 1 instead of 0?
1522	nit: you can use `auto` here, because it's obvious from the RHS what the type will be

MattDevereau marked an inline comment as done.Jan 5 2023, 9:22 AM

MattDevereau added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437	I'm questioning whether this is adding gold plating. The SmallVector's are of pointers to values so it's just the SmallVector objects that are being copied, and this function will recurse at most 15 times. This will also need rewiring to use bool returns as the input and output vectors from this function are compared for equality to check if any pattern was found. It will be beneficial though so I'll have a go at implementing it.
1483	The latest insert is used (%y in this case) and the transform works ok, I'll add a test for this.
1500–1501	It can happen in cases such as <a, b, c, _, a, b, c, _> which I will add a test for
1503	`Pattern.size() > 1` will be true but Pattern[0] can be null when the pattern `<_, b, c, d>` is evaluated. The previous revision hoisted this, but some tests were breaking as it was trying to create an InsertElement with a null 0th parameter . I think it makes sense to check it in the loop instead of hoisting it to make sure you don't do this.

MattDevereau updated this revision to Diff 487288.Jan 8 2023, 10:17 PM

MattDevereau marked an inline comment as done.

MattDevereau marked 2 inline comments as done.Jan 8 2023, 10:22 PM

MattDevereau added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437	I've re-written this to now use a single vector passed by reference.
1438	I've fully replaced references to std::size_t with size_t now
1483	I've added the test `dupq_f16_abcd_pattern_double_insert` to `sve-intrinsic-dupqlane.ll` which asserts this behaviour.
1500–1501	I've added the test `dupq_f16_abcnull_pattern` to `sve-intrinsic-dupqlane.ll` to assert this.

Harbormaster completed remote builds in B206432: Diff 487288.Jan 8 2023, 11:06 PM

sdesmalen added inline comments.Jan 9 2023, 4:24 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437	Thanks, that looks much better!
1468–1469	Why is this optimisation limited to integer types?
1471–1472	It might be easier to write this as: Value CurrentInsertElt = nullptr, Default = nullptr; if (!match(II, m_Intrinsic<Intrinsic::vector_insert>(m_Value(Default), m_Value(CurrentInsertElt), m_Value()) \|\| !isa<FixedVectorType(CurrentInsertElt->getType()) as that matches the operands in one go.
1481	Does this need to be a decrementing loop? Or could you write: while (InsertElementInst *InsertElt = dyn_cast<InsertElementInst>(CurrentInsertElt)) { auto Idx = cast<ConstantInt>(InsertElt->getOperand(2)); InsertEltVec[Idx->getValue().getZExtValue()] = InsertElt->getOperand(1); CurrentInsertElt = InsertElt->getOperand(0); }
1491	You'll also need to check the lanes of the original vector being inserted into: %1 = insertelement <4 x float> poison, float %x, i64 0 ; <--- this poison value must be checked instead of the llvm.vector.insert one. %2 = insertelement <4 x float> %1, float %y, i64 1 %3 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> poison, <4 x float> %2, i64 0) %4 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %3, i64 0) However, you'll also need to check the default value of the llvm.vector.insert for this case: %1 = insertelement <2 x float> poison, float %x, i64 0 %2 = insertelement <2 x float> %1, float %y, i64 1 %3 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v2f32(<vscale x 4 x float> poison, <2 x float> %2, i64 0) ; the full subvector is defined, but not all lanes in the subvector that's being dupq'ed. %4 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %3, i64 0)

MattDevereau updated this revision to Diff 487431.Jan 9 2023, 7:30 AM

MattDevereau marked 3 inline comments as done.

MattDevereau added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1468–1469	Oops, I ran into problems with integer types on the first pass I did on this patch but I can see the codegen is quite nice now with a single fmov: fmov s0, w0 mov v0.h[1], w1 mov v0.h[2], w2 mov v0.h[3], w3 mov z0.d, d0 ret for example with `<a, b, c d>` as i16s. I'll remove this.
1471–1472	That looks a bit nicer, done
1481	It doesn't need to be a decrementing loop. This works well too so i've put it in
1491	We discussed outside the review that this would be ideally done in a follow-up patch.

Removed the logic to insert incomplete vectors into poison vectors.

Added a condition if (!isa<PoisonValue>(CurrentInsertElt) || !isa<PoisonValue>(Default) || to check all lanes of the subvector being inserted are defined, and the first inserted element is inserted into poison

Harbormaster completed remote builds in B206555: Diff 487458.Jan 9 2023, 10:05 AM

I've left a bunch more comments, but overall this patch is moving in the right direction @MattDevereau!

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1443–1448	nit: please capitalise Lhs -> LHS and Rhs -> RHS (that's more common in LLVM for these variable names). And feel free to ignore this suggestion, but I personally find this a little easier to read: ArrayRef<Value*> Ref(Vec); auto LHS = Ref.take_front(HalfVecSize); auto RHS = Ref.drop_front(HalfVecSize); for (unsigned I=0; I<HalfVecSize; ++I) { if (LHS[I] != nullptr && RHS[I] != nullptr && LHS[I] == RHS[I]) continue; (i.e. using ArrayRef and indexing, rather than two changing pointers)
1451–1452	Can this condition be moved to the start of this function?
1470	nit: is it worth renaming this to `Elts` ? That's a bit simpler.
1471	nit: this can be `auto`, because the type is clear from the `dyn_cast<InsertElementInst>`.
1478	I don't think you'll need to check this, since you've already got the check for `Lhs != nullptr && Rhs != nullptr` in SimplifyValuePattern, which should always return `false` if any of the lanes is not explicitly defined. And because the size of `InsertEltVec` is based on the minimum number of elements of the scalable vector, it will also return false for the following case: %0 = insertelement <2 x half> poison, half %a, i64 0 %1 = insertelement <2 x half> %0, half %b, i64 0 %2 = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v2f16(<vscale x 8 x half> poison, <2 x half> %1, i64 0) Because InsertEltVec will be defined as: [%a, %b, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr]
1483	I still think that the code is not correct yet, because for e.g. ... %7 = insertelement <8 x half> %6, half %c, i64 6 %8 = insertelement <8 x half> %7, half %c, i64 7 %9 = insertelement <8 x half> %8, half %d, i64 7 %10 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %9, i64 0) It starts at %9, so will first store %d to InsertEltVec[7]. Then it looks at %8, and it will store %c to InsertEltVec[7] Which means the order is reversed. The reason that your test still passes is that InstCombine has already removed the extra `insertelement` when it comes to this function. Perhaps you can just return `false` if the element already exists in InsertEltVec?
1486	If you define `InsertEltChain` as `PoisonValue::get(CurrentInsertElt->getType())`, then you can start the loop at I = 0.
1489	I think you can do just `Builder.getInt64(I)` (and then you don't need the extra variable for it, and can propagate it directly into the expression below, and also drop the curly braces :)
1498	unsigned?
1499	unsigned?
1508	can just as well use `auto` or `Value *` for all these variables here, since their contents don't really matter too much.
llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
31	It's a pity this case isn't optimised. This is probably because instcombine has transformed it into a 'shufflevector' splat. You could do something like this: if (Value V = getSplatValue(CurrentInsertElt)) { InsertEltVec[0] = V; InsertEltVec.resize(1); } else { while (InsertElementInst InsertElt = dyn_cast<InsertElementInst>(CurrentInsertElt)) { ... } to handle it.
95	Is this test much different from `@dupq_f16_ab_pattern_no_end_indices`?
137–138	If you swap `%c` and `%d` this at least becomes a negative test to ensure LLVM doesn't perform the wrong optimisation. That's better than it being a positive test where it just happens to do the right thing because another InstCombine rule has fired first, even though your code would have optimised this incorrectly.
238	no end indices not poison? (perhaps there is a double negative here that confuses me?) In any case, the end indices are poison, because the test is inserting into `poison`
261	Given the algorithm just compares two halves of the vector, I'm not sure there's much value in having both negative tests for "no_front_pattern", "no_middle_pattern" and "no_end_pattern"

MattDevereau added inline comments.Jan 10 2023, 7:51 AM

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
31	I think the assembly output ends up as just a single MOV instruction and there isn't much to gain here (I'll double check this), the thinking behind this test was more just to prove the recursion can indeed get the smallest possible pattern `a`.
95	This test would splat `<a, b, c, _>` as a 64bit value whereas `@dupq_f16_ab_pattern_no_end_indices` would splat `<a, b>` as a 32bit value, which could check the algorithm can distinguish between the two, however I suppose that difference is ultimately not so important so I'll remove the indices tests.
238	Some of these tests that appear redundant are leftover from the previous revisions where allowing inserts of poison values changed things. The first negative is the indices 6 & 7 missing from the `insertelement` chain. This second negative is because the first parameter of the vector insert is a non-poison value`c`unlike the other tests which are all poison.
261	These tests don't mean much without the poison value insertions now, so i'll remove them

MattDevereau updated this revision to Diff 488148.Jan 11 2023, 4:01 AM

MattDevereau marked 15 inline comments as done.

Harbormaster completed remote builds in B207040: Diff 488148.Jan 11 2023, 5:16 AM

MattDevereau added inline comments.Jan 11 2023, 12:00 PM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1443–1448	I think there's pros and cons to this approach, in your example the setup looks nicer but I think the ability to resize the vector without passing indices through function parameters looks cleaner and makes the result easier to process. I will stick with the current implementation for now I think just to keep the review more compact.
1451–1452	Yes it can be, done
1478	Fair enough, removed.
1489	This loop no longer looks quite as hideous, thanks :)
llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
31	This case currently gets emitted as a neon mov: dup v0.8h, v0.h[0] mov z0.q, q0 ret Implementing your suggestion does change this to `mov z0.h, h0` but also drastically changes all tests in `sve-intrinsic-opts-cmpne.ll`. I think this optimization is best left for another patch.

LGTM (please address my nit on the test before you land it)

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1443–1448	My suggestion was unrelated to resizing the vector, it was only meant for this loop. But I'm happy with the current version too.
llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
31	Fair enough, I didn't expect that would happen.
137–138	Can you rename this test so that it's clear this is a negative test? (and also maybe add a comment?)

e18b971685fb349299583d95716244f34f974ef8

MattDevereau mentioned this in D141846: [AArch64] Allow poison elements of fixed-vectors to be duplicated as a widened element.Jan 16 2023, 6:31 AM

MattDevereau mentioned this in rG48df06f1d00c: [AArch64] Allow poison elements of fixed-vectors to be duplicated as a widened….Jan 19 2023, 8:04 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

82 lines

test/

CodeGen/

AArch64/

sve-intrinsics-perm-select.ll

52 lines

Transforms/

InstCombine/

AArch64/

sve-intrinsic-dupqlane.ll

207 lines

Diff 488148

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,375 Lines • ▼ Show 20 Lines	instCombineST1ScatterIndex(InstCombiner &IC, IntrinsicInst &II) {
Type *Ty = Val->getType();		Type *Ty = Val->getType();

// Contiguous scatter => masked store.		// Contiguous scatter => masked store.
// (sve.st1.scatter.index Value Mask BasePtr (sve.index IndexBase 1))		// (sve.st1.scatter.index Value Mask BasePtr (sve.index IndexBase 1))
// => (masked.store Value (gep BasePtr IndexBase) Align Mask)		// => (masked.store Value (gep BasePtr IndexBase) Align Mask)
Value *IndexBase;		Value *IndexBase;
if (match(Index, m_Intrinsic<Intrinsic::aarch64_sve_index>(		if (match(Index, m_Intrinsic<Intrinsic::aarch64_sve_index>(
m_Value(IndexBase), m_SpecificInt(1)))) {		m_Value(IndexBase), m_SpecificInt(1)))) {
IRBuilder<> Builder(II.getContext());		IRBuilder<> Builder(II.getContext());
		tschuettUnsubmitted Not Done Reply Inline Actions Please use `std::optional`. The LLVM variant is going away. tschuett: Please use `std::optional`. The LLVM variant is going away.
Builder.SetInsertPoint(&II);		Builder.SetInsertPoint(&II);

Align Alignment =		Align Alignment =
BasePtr->getPointerAlignment(II.getModule()->getDataLayout());		BasePtr->getPointerAlignment(II.getModule()->getDataLayout());

Value *Ptr = Builder.CreateGEP(cast<VectorType>(Ty)->getElementType(),		Value *Ptr = Builder.CreateGEP(cast<VectorType>(Ty)->getElementType(),
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: VecTy->getScalarType() can be propagated into the few uses below. sdesmalen: nit: VecTy->getScalarType() can be propagated into the few uses below.
BasePtr, IndexBase);		BasePtr, IndexBase);
Type *VecPtrTy = PointerType::getUnqual(Ty);		Type *VecPtrTy = PointerType::getUnqual(Ty);
Ptr = Builder.CreateBitCast(Ptr, VecPtrTy);		Ptr = Builder.CreateBitCast(Ptr, VecPtrTy);

(void)Builder.CreateMaskedStore(Val, Ptr, Alignment, Mask);		(void)Builder.CreateMaskedStore(Val, Ptr, Alignment, Mask);

return IC.eraseInstFromFunction(II);		return IC.eraseInstFromFunction(II);
}		}

return std::nullopt;		return std::nullopt;
}		}
		sdesmalenUnsubmitted Not Done Reply Inline Actions Should it also bail out if the vector being inserted is not a FixedLengthVector? sdesmalen: Should it also bail out if the vector being inserted is not a FixedLengthVector?

		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: This can be moved to the start of the function, so that it bails out immediately if the conditions are not as expected. sdesmalen: nit: This can be moved to the start of the function, so that it bails out immediately if the…
static std::optional<Instruction *> instCombineSVESDIV(InstCombiner &IC,		static std::optional<Instruction *> instCombineSVESDIV(InstCombiner &IC,
IntrinsicInst &II) {		IntrinsicInst &II) {
IRBuilder<> Builder(II.getContext());		IRBuilder<> Builder(II.getContext());
Builder.SetInsertPoint(&II);		Builder.SetInsertPoint(&II);
Type *Int32Ty = Builder.getInt32Ty();		Type *Int32Ty = Builder.getInt32Ty();
Value *Pred = II.getOperand(0);		Value *Pred = II.getOperand(0);
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: LLVM's coding style uses capitalised variable names, so `I` in this case. sdesmalen: nit: LLVM's coding style uses capitalised variable names, so `I` in this case.
		sdesmalenUnsubmitted Not Done Reply Inline Actions Does this algorithm work if the vector being inserted is not a power-of-two vector? (I guess when it is inserting into an 'undef' vector, you could still consider the other values to be anything you like and the algorithm may still work). Could you also add a test-case for this? sdesmalen: Does this algorithm work if the vector being inserted is not a power-of-two vector? (I guess…
Value *Vec = II.getOperand(1);		Value *Vec = II.getOperand(1);
Value *DivVec = II.getOperand(2);		Value *DivVec = II.getOperand(2);

Value *SplatValue = getSplatValue(DivVec);		Value *SplatValue = getSplatValue(DivVec);
ConstantInt *SplatConstantInt = dyn_cast_or_null<ConstantInt>(SplatValue);		ConstantInt *SplatConstantInt = dyn_cast_or_null<ConstantInt>(SplatValue);
if (!SplatConstantInt)		if (!SplatConstantInt)
return std::nullopt;		return std::nullopt;
APInt Divisor = SplatConstantInt->getValue();		APInt Divisor = SplatConstantInt->getValue();

if (Divisor.isPowerOf2()) {		if (Divisor.isPowerOf2()) {
Constant *DivisorLog2 = ConstantInt::get(Int32Ty, Divisor.logBase2());		Constant *DivisorLog2 = ConstantInt::get(Int32Ty, Divisor.logBase2());
auto ASRD = Builder.CreateIntrinsic(		auto ASRD = Builder.CreateIntrinsic(
Intrinsic::aarch64_sve_asrd, {II.getType()}, {Pred, Vec, DivisorLog2});		Intrinsic::aarch64_sve_asrd, {II.getType()}, {Pred, Vec, DivisorLog2});
return IC.replaceInstUsesWith(II, ASRD);		return IC.replaceInstUsesWith(II, ASRD);
}		}
if (Divisor.isNegatedPowerOf2()) {		if (Divisor.isNegatedPowerOf2()) {
Divisor.negate();		Divisor.negate();
Constant *DivisorLog2 = ConstantInt::get(Int32Ty, Divisor.logBase2());		Constant *DivisorLog2 = ConstantInt::get(Int32Ty, Divisor.logBase2());
auto ASRD = Builder.CreateIntrinsic(		auto ASRD = Builder.CreateIntrinsic(
Intrinsic::aarch64_sve_asrd, {II.getType()}, {Pred, Vec, DivisorLog2});		Intrinsic::aarch64_sve_asrd, {II.getType()}, {Pred, Vec, DivisorLog2});
		tschuettUnsubmitted Not Done Reply Inline Actions Please use `std::nullopt`. tschuett: Please use `std::nullopt`.
		sdesmalenUnsubmitted Not Done Reply Inline Actions This is a little bit restrictive, because it could also work for e.g. <16 x i8> <a, b, c, d, a, b, c, d, a, b, c, d, a, b, c, d> where only `<a, b, c, d>` would need to be splat. It might help instead to find the 'minimum' set by recursively halving the vector and seeing if all elements match. e.g. <a, b, a, b, a, b, a, b> => <a, b, a, b> == <a, b, a, b> => <a, b> == <a, b> so that the minimum set to splat is `<a, b>` sdesmalen: This is a little bit restrictive, because it could also work for e.g. <16 x i8> <a, b, c, d…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've implemented a recursive function which now handles `<a, b, c, d, a, b, c, d>` MattDevereau: I've implemented a recursive function which now handles `<a, b, c, d, a, b, c, d>`
auto NEG = Builder.CreateIntrinsic(Intrinsic::aarch64_sve_neg,		auto NEG = Builder.CreateIntrinsic(Intrinsic::aarch64_sve_neg,
{ASRD->getType()}, {ASRD, Pred, ASRD});		{ASRD->getType()}, {ASRD, Pred, ASRD});
return IC.replaceInstUsesWith(II, NEG);		return IC.replaceInstUsesWith(II, NEG);
}		}

return std::nullopt;		return std::nullopt;
}		}
		nlopesUnsubmitted Not Done Reply Inline Actions Please use PoisoValue here and whenever possible. We are trying to remove undef from LLVM. Actually, here you can even use a different API: `CreateInsertElement(Type VecTy, Value NewElt, Value Idx)` Thank you! nlopes:* Please use PoisoValue here and whenever possible. We are trying to remove undef from LLVM.

		bool SimplifyValuePattern(SmallVector<Value *> &Vec) {
		sdesmalenUnsubmitted Not Done Reply Inline Actions It's better to take and return an ArrayRef<Value>, rather than a SmallVector object. sdesmalen:* It's better to take and return an ArrayRef<Value*>, rather than a SmallVector object.
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions With the addition of allowing inserts into poison vectors to match patterns, the `SmallVector` now has to be mutable in order to replace poison values. MattDevereau: With the addition of allowing inserts into poison vectors to match patterns, the `SmallVector`…
		sdesmalenUnsubmitted Done Reply Inline Actions It's better to not pass/return large objects by value. If you can't use ArrayRef because you have to modify the values, perhaps you can pass it by reference, and make it both the input and output array (for example by truncating the result) sdesmalen: It's better to not pass/return large objects by value. If you can't use ArrayRef because you…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I'm questioning whether this is adding gold plating. The SmallVector's are of pointers to values so it's just the SmallVector objects that are being copied, and this function will recurse at most 15 times. This will also need rewiring to use bool returns as the input and output vectors from this function are compared for equality to check if any pattern was found. It will be beneficial though so I'll have a go at implementing it. MattDevereau: I'm questioning whether this is adding gold plating. The SmallVector's are of pointers to…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've re-written this to now use a single vector passed by reference. MattDevereau: I've re-written this to now use a single vector passed by reference.
		sdesmalenUnsubmitted Not Done Reply Inline Actions Thanks, that looks much better! sdesmalen: Thanks, that looks much better!
		size_t VecSize = Vec.size();
		sdesmalenUnsubmitted Done Reply Inline Actions nit: you can drop the `std::` here. sdesmalen: nit: you can drop the `std::` here.
		sdesmalenUnsubmitted Done Reply Inline Actions You marked this as Done, but it seems unchanged. sdesmalen: You marked this as Done, but it seems unchanged.
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've fully replaced references to std::size_t with size_t now MattDevereau: I've fully replaced references to std::size_t with size_t now
		if (VecSize == 1)
		return true;
		if (!isPowerOf2_64(VecSize))
		return false;
		size_t HalfVecSize = VecSize / 2;
		sdesmalenUnsubmitted Not Done Reply Inline Actions if the code is not going to modify the content of the array, it's better to use ArrayRef to avoid creating a new object and copying (existing) data. sdesmalen: if the code is not going to modify the content of the array, it's better to use ArrayRef to…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions With the addition of allowing inserts into poison vectors to match patterns, the `SmallVector` now has to be mutable in order to replace poison values. MattDevereau: With the addition of allowing inserts into poison vectors to match patterns, the `SmallVector`…

		for (auto LHS = Vec.begin(), RHS = Vec.begin() + HalfVecSize;
		RHS != Vec.end(); LHS++, RHS++) {
		if (LHS != nullptr && RHS != nullptr && LHS == RHS)
		continue;
		sdesmalenUnsubmitted Done Reply Inline Actions nit: please capitalise Lhs -> LHS and Rhs -> RHS (that's more common in LLVM for these variable names). And feel free to ignore this suggestion, but I personally find this a little easier to read: ArrayRef<Value> Ref(Vec); auto LHS = Ref.take_front(HalfVecSize); auto RHS = Ref.drop_front(HalfVecSize); for (unsigned I=0; I<HalfVecSize; ++I) { if (LHS[I] != nullptr && RHS[I] != nullptr && LHS[I] == RHS[I]) continue; (i.e. using ArrayRef and indexing, rather than two changing pointers) sdesmalen:* nit: please capitalise Lhs -> LHS and Rhs -> RHS (that's more common in LLVM for these variable…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I think there's pros and cons to this approach, in your example the setup looks nicer but I think the ability to resize the vector without passing indices through function parameters looks cleaner and makes the result easier to process. I will stick with the current implementation for now I think just to keep the review more compact. MattDevereau: I think there's pros and cons to this approach, in your example the setup looks nicer but I…
		sdesmalenUnsubmitted Not Done Reply Inline Actions My suggestion was unrelated to resizing the vector, it was only meant for this loop. But I'm happy with the current version too. sdesmalen: My suggestion was unrelated to resizing the vector, it was only meant for this loop. But I'm…
		return false;
		}

		Vec.resize(HalfVecSize);
		sdesmalenUnsubmitted Done Reply Inline Actions Can this condition be moved to the start of this function? sdesmalen: Can this condition be moved to the start of this function?
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Yes it can be, done MattDevereau: Yes it can be, done
		SimplifyValuePattern(Vec);
		return true;
		}

		// Try to simplify dupqlane patterns like dupqlane(f32 A, f32 B, f32 A, f32 B)
		// to dupqlane(f64(C)) where C is A concatenated with B
		static std::optional<Instruction *> instCombineSVEDupqLane(InstCombiner &IC,
		IntrinsicInst &II) {
		Value CurrentInsertElt = nullptr, Default = nullptr;
		if (!match(II.getOperand(0),
		m_Intrinsic<Intrinsic::vector_insert>(
		m_Value(Default), m_Value(CurrentInsertElt), m_Value())) \|\|
		!isa<FixedVectorType>(CurrentInsertElt->getType()))
		return std::nullopt;
		auto IIScalableTy = cast<ScalableVectorType>(II.getType());

		// Insert the scalars into a container ordered by InsertElement index
		sdesmalenUnsubmitted Not Done Reply Inline Actions Why is this optimisation limited to integer types? sdesmalen: Why is this optimisation limited to integer types?
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Oops, I ran into problems with integer types on the first pass I did on this patch but I can see the codegen is quite nice now with a single fmov: fmov s0, w0 mov v0.h[1], w1 mov v0.h[2], w2 mov v0.h[3], w3 mov z0.d, d0 ret for example with `<a, b, c d>` as i16s. I'll remove this. MattDevereau: Oops, I ran into problems with integer types on the first pass I did on this patch but I can…
		SmallVector<Value *> Elts(IIScalableTy->getMinNumElements(), nullptr);
		sdesmalenUnsubmitted Done Reply Inline Actions nit: is it worth renaming this to `Elts` ? That's a bit simpler. sdesmalen: nit: is it worth renaming this to `Elts` ? That's a bit simpler.
		while (auto InsertElt = dyn_cast<InsertElementInst>(CurrentInsertElt)) {
		sdesmalenUnsubmitted Done Reply Inline Actions nit: this can be `auto`, because the type is clear from the `dyn_cast<InsertElementInst>`. sdesmalen: nit: this can be `auto`, because the type is clear from the `dyn_cast<InsertElementInst>`.
		auto Idx = cast<ConstantInt>(InsertElt->getOperand(2));
		sdesmalenUnsubmitted Done Reply Inline Actions It might be easier to write this as: Value CurrentInsertElt = nullptr, Default = nullptr; if (!match(II, m_Intrinsic<Intrinsic::vector_insert>(m_Value(Default), m_Value(CurrentInsertElt), m_Value()) \|\| !isa<FixedVectorType(CurrentInsertElt->getType()) as that matches the operands in one go. sdesmalen: It might be easier to write this as: Value CurrentInsertElt = nullptr, Default = nullptr…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions That looks a bit nicer, done MattDevereau: That looks a bit nicer, done
		Elts[Idx->getValue().getZExtValue()] = InsertElt->getOperand(1);
		CurrentInsertElt = InsertElt->getOperand(0);
		}

		if (!SimplifyValuePattern(Elts))
		sdesmalenUnsubmitted Not Done Reply Inline Actions When the loop `break`s here, it will leave the other elements in InsertEltVec as `nullptr` from which you seem to infer `poison` or `undef` in SimplifyValuePattern. That's not always correct, because the IR could be inserting into some other vector, e.g. `<4 x i32> %arg` where these values are defined (it would be good to have a test for this) sdesmalen: When the loop `break`s here, it will leave the other elements in InsertEltVec as `nullptr` from…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions This should now be ok with the inserting into poison logic added. I added the test `dupq_f16_ab_pattern_no_end_indices_not_poison` which I believe should cover this. MattDevereau: This should now be ok with the inserting into poison logic added. I added the test…
		return std::nullopt;
		sdesmalenUnsubmitted Done Reply Inline Actions I don't think you'll need to check this, since you've already got the check for `Lhs != nullptr && Rhs != nullptr` in SimplifyValuePattern, which should always return `false` if any of the lanes is not explicitly defined. And because the size of `InsertEltVec` is based on the minimum number of elements of the scalable vector, it will also return false for the following case: %0 = insertelement <2 x half> poison, half %a, i64 0 %1 = insertelement <2 x half> %0, half %b, i64 0 %2 = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v2f16(<vscale x 8 x half> poison, <2 x half> %1, i64 0) Because InsertEltVec will be defined as: [%a, %b, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr] sdesmalen: I don't think you'll need to check this, since you've already got the check for `*Lhs !=…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Fair enough, removed. MattDevereau: Fair enough, removed.

		// Rebuild the simplified chain of InsertElements. e.g. (a, b, a, b) as (a, b)
		IRBuilder<> Builder(II.getContext());
		sdesmalenUnsubmitted Done Reply Inline Actions Does this need to be a decrementing loop? Or could you write: while (InsertElementInst InsertElt = dyn_cast<InsertElementInst>(CurrentInsertElt)) { auto Idx = cast<ConstantInt>(InsertElt->getOperand(2)); InsertEltVec[Idx->getValue().getZExtValue()] = InsertElt->getOperand(1); CurrentInsertElt = InsertElt->getOperand(0); } sdesmalen:* Does this need to be a decrementing loop? Or could you write: while (InsertElementInst…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions It doesn't need to be a decrementing loop. This works well too so i've put it in MattDevereau: It doesn't need to be a decrementing loop. This works well too so i've put it in
		Builder.SetInsertPoint(&II);
		Value *InsertEltChain = PoisonValue::get(CurrentInsertElt->getType());
		sdesmalenUnsubmitted Not Done Reply Inline Actions What happens if there is an insert into the same lane? e.g. %1 = insertelement <8 x half> %0, half %x, i64 1 %2 = insertelement <8 x half> %1, half %y, i64 1 sdesmalen: What happens if there is an insert into the same lane? e.g. %1 = insertelement <8 x half>…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions The latest insert is used (%y in this case) and the transform works ok, I'll add a test for this. MattDevereau: The latest insert is used (%y in this case) and the transform works ok, I'll add a test for…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've added the test `dupq_f16_abcd_pattern_double_insert` to `sve-intrinsic-dupqlane.ll` which asserts this behaviour. MattDevereau: I've added the test `dupq_f16_abcd_pattern_double_insert` to `sve-intrinsic-dupqlane.ll` which…
		sdesmalenUnsubmitted Not Done Reply Inline Actions I still think that the code is not correct yet, because for e.g. ... %7 = insertelement <8 x half> %6, half %c, i64 6 %8 = insertelement <8 x half> %7, half %c, i64 7 %9 = insertelement <8 x half> %8, half %d, i64 7 %10 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %9, i64 0) It starts at %9, so will first store %d to InsertEltVec[7]. Then it looks at %8, and it will store %c to InsertEltVec[7] Which means the order is reversed. The reason that your test still passes is that InstCombine has already removed the extra `insertelement` when it comes to this function. Perhaps you can just return `false` if the element already exists in InsertEltVec? sdesmalen: I still think that the code is not correct yet, because for e.g. ... %7 = insertelement <8…
		for (size_t I = 0; I < Elts.size(); I++) {
		InsertEltChain = Builder.CreateInsertElement(InsertEltChain, Elts[I],
		Builder.getInt64(I));
		sdesmalenUnsubmitted Done Reply Inline Actions If you define `InsertEltChain` as `PoisonValue::get(CurrentInsertElt->getType())`, then you can start the loop at I = 0. sdesmalen: If you define `InsertEltChain` as `PoisonValue::get(CurrentInsertElt->getType())`, then you can…
		}

		// Splat the simplified sequence, e.g. (f16 a, f16 b, f16 c, f16 d) as one i64
		sdesmalenUnsubmitted Done Reply Inline Actions I think you can do just `Builder.getInt64(I)` (and then you don't need the extra variable for it, and can propagate it directly into the expression below, and also drop the curly braces :) sdesmalen: I think you can do just `Builder.getInt64(I)` (and then you don't need the extra variable for…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions This loop no longer looks quite as hideous, thanks :) MattDevereau: This loop no longer looks quite as hideous, thanks :)
		// value or (f16 a, f16 b) as one i32 value. This requires an InsertSubvector
		// be bitcast to a type wide enough to fit the sequence, be splatted, and then
		sdesmalenUnsubmitted Not Done Reply Inline Actions You'll also need to check the lanes of the original vector being inserted into: %1 = insertelement <4 x float> poison, float %x, i64 0 ; <--- this poison value must be checked instead of the llvm.vector.insert one. %2 = insertelement <4 x float> %1, float %y, i64 1 %3 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> poison, <4 x float> %2, i64 0) %4 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %3, i64 0) However, you'll also need to check the default value of the llvm.vector.insert for this case: %1 = insertelement <2 x float> poison, float %x, i64 0 %2 = insertelement <2 x float> %1, float %y, i64 1 %3 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v2f32(<vscale x 4 x float> poison, <2 x float> %2, i64 0) ; the full subvector is defined, but not all lanes in the subvector that's being dupq'ed. %4 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %3, i64 0) sdesmalen: You'll also need to check the lanes of the original vector being inserted into: %1 =…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions We discussed outside the review that this would be ideally done in a follow-up patch. MattDevereau: We discussed outside the review that this would be ideally done in a follow-up patch.
		// be narrowed back to the original type.
		unsigned PatternWidth = IIScalableTy->getScalarSizeInBits() * Elts.size();
		unsigned PatternElementCount = IIScalableTy->getScalarSizeInBits() *
		IIScalableTy->getMinNumElements() /
		PatternWidth;

		IntegerType *WideTy = Builder.getIntNTy(PatternWidth);
		sdesmalenUnsubmitted Done Reply Inline Actions unsigned? sdesmalen: unsigned?
		auto *WideScalableTy = ScalableVectorType::get(WideTy, PatternElementCount);
		sdesmalenUnsubmitted Done Reply Inline Actions unsigned? sdesmalen: unsigned?
		auto *WideShuffleMaskTy =
		ScalableVectorType::get(Builder.getInt32Ty(), PatternElementCount);
		sdesmalenUnsubmitted Not Done Reply Inline Actions Can this ever happen? I would expect the values in Pattern to always be defined at this point. sdesmalen: Can this ever happen? I would expect the values in Pattern to always be defined at this point.
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions It can happen in cases such as <a, b, c, _, a, b, c, _> which I will add a test for MattDevereau: It can happen in cases such as <a, b, c, _, a, b, c, _> which I will add a test for
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I've added the test `dupq_f16_abcnull_pattern` to `sve-intrinsic-dupqlane.ll` to assert this. MattDevereau: I've added the test `dupq_f16_abcnull_pattern` to `sve-intrinsic-dupqlane.ll` to assert this.

		auto ZeroIdx = ConstantInt::get(Builder.getInt64Ty(), APInt(64, 0));
		sdesmalenUnsubmitted Not Done Reply Inline Actions At this point we know the Pattern.size() > 1, so can you hoist this out of the loop, and start the loop counter `I` at 1 instead of 0? sdesmalen: At this point we know the Pattern.size() > 1, so can you hoist this out of the loop, and start…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions `Pattern.size() > 1` will be true but Pattern[0] can be null when the pattern `<_, b, c, d>` is evaluated. The previous revision hoisted this, but some tests were breaking as it was trying to create an InsertElement with a null 0th parameter . I think it makes sense to check it in the loop instead of hoisting it to make sure you don't do this. MattDevereau: `Pattern.size() > 1` will be true but Pattern[0] can be null when the pattern `<_, b, c, d>` is…
		auto InsertSubvector = Builder.CreateInsertVector(
		II.getType(), PoisonValue::get(II.getType()), InsertEltChain, ZeroIdx);
		auto WideBitcast =
		Builder.CreateBitOrPointerCast(InsertSubvector, WideScalableTy);
		auto WideShuffleMask = ConstantAggregateZero::get(WideShuffleMaskTy);
		sdesmalenUnsubmitted Done Reply Inline Actions can just as well use `auto` or `Value ` for all these variables here, since their contents don't really matter too much. sdesmalen:* can just as well use `auto` or `Value *` for all these variables here, since their contents…
		auto WideShuffle = Builder.CreateShuffleVector(
		WideBitcast, PoisonValue::get(WideScalableTy), WideShuffleMask);
		auto NarrowBitcast =
		Builder.CreateBitOrPointerCast(WideShuffle, II.getType());

		return IC.replaceInstUsesWith(II, NarrowBitcast);
		}

static std::optional<Instruction *> instCombineMaxMinNM(InstCombiner &IC,		static std::optional<Instruction *> instCombineMaxMinNM(InstCombiner &IC,
IntrinsicInst &II) {		IntrinsicInst &II) {
Value *A = II.getArgOperand(0);		Value *A = II.getArgOperand(0);
Value *B = II.getArgOperand(1);		Value *B = II.getArgOperand(1);
if (A == B)		if (A == B)
return IC.replaceInstUsesWith(II, A);		return IC.replaceInstUsesWith(II, A);
		sdesmalenUnsubmitted Done Reply Inline Actions nit: you can use `auto` here, because it's obvious from the RHS what the type will be sdesmalen: nit: you can use `auto` here, because it's obvious from the RHS what the type will be

return std::nullopt;		return std::nullopt;
}		}

static std::optional<Instruction *> instCombineSVESrshl(InstCombiner &IC,		static std::optional<Instruction *> instCombineSVESrshl(InstCombiner &IC,
IntrinsicInst &II) {		IntrinsicInst &II) {
IRBuilder<> Builder(&II);		IRBuilder<> Builder(&II);
Value *Pred = II.getOperand(0);		Value *Pred = II.getOperand(0);
▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	AArch64TTIImpl::instCombineIntrinsic(InstCombiner &IC,
case Intrinsic::aarch64_sve_st1:		case Intrinsic::aarch64_sve_st1:
return instCombineSVEST1(IC, II, DL);		return instCombineSVEST1(IC, II, DL);
case Intrinsic::aarch64_sve_sdiv:		case Intrinsic::aarch64_sve_sdiv:
return instCombineSVESDIV(IC, II);		return instCombineSVESDIV(IC, II);
case Intrinsic::aarch64_sve_sel:		case Intrinsic::aarch64_sve_sel:
return instCombineSVESel(IC, II);		return instCombineSVESel(IC, II);
case Intrinsic::aarch64_sve_srshl:		case Intrinsic::aarch64_sve_srshl:
return instCombineSVESrshl(IC, II);		return instCombineSVESrshl(IC, II);
		case Intrinsic::aarch64_sve_dupq_lane:
		return instCombineSVEDupqLane(IC, II);
}		}

return std::nullopt;		return std::nullopt;
}		}

std::optional<Value *> AArch64TTIImpl::simplifyDemandedVectorEltsIntrinsic(		std::optional<Value *> AArch64TTIImpl::simplifyDemandedVectorEltsIntrinsic(
InstCombiner &IC, IntrinsicInst &II, APInt OrigDemandedElts,		InstCombiner &IC, IntrinsicInst &II, APInt OrigDemandedElts,
APInt &UndefElts, APInt &UndefElts2, APInt &UndefElts3,		APInt &UndefElts, APInt &UndefElts2, APInt &UndefElts3,
▲ Show 20 Lines • Show All 1,743 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-intrinsics-perm-select.ll

	Show First 20 Lines • Show All 581 Lines • ▼ Show 20 Lines
	}			}
	;			;
	; EXT			; EXT
	;			;

	define dso_local <vscale x 4 x float> @dupq_f32_repeat_complex(float %x, float %y) {			define dso_local <vscale x 4 x float> @dupq_f32_repeat_complex(float %x, float %y) {
	; CHECK-LABEL: dupq_f32_repeat_complex:			; CHECK-LABEL: dupq_f32_repeat_complex:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0			; CHECK-NEXT: // kill: def $s0 killed $s0 def $z0
	; CHECK-NEXT: mov v2.16b, v0.16b
	; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1			; CHECK-NEXT: // kill: def $s1 killed $s1 def $q1
	; CHECK-NEXT: mov v2.s[1], v1.s[0]			; CHECK-NEXT: mov v0.s[1], v1.s[0]
	; CHECK-NEXT: mov v2.s[2], v0.s[0]			; CHECK-NEXT: mov z0.d, d0
	; CHECK-NEXT: mov v2.s[3], v1.s[0]
	; CHECK-NEXT: mov z0.q, q2
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = insertelement <4 x float> undef, float %x, i64 0			%1 = insertelement <4 x float> undef, float %x, i64 0
	%2 = insertelement <4 x float> %1, float %y, i64 1			%2 = insertelement <4 x float> %1, float %y, i64 1
	%3 = insertelement <4 x float> %2, float %x, i64 2			%3 = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> %2, i64 0)
	%4 = insertelement <4 x float> %3, float %y, i64 3			%4 = bitcast <vscale x 4 x float> %3 to <vscale x 2 x double>
	%5 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> %4, i64 0)			%5 = shufflevector <vscale x 2 x double> %4, <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
	%6 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %5, i64 0)			%6 = bitcast <vscale x 2 x double> %5 to <vscale x 4 x float>
	ret <vscale x 4 x float> %6			ret <vscale x 4 x float> %6
	}			}

	define dso_local <vscale x 8 x half> @dupq_f16_repeat_complex(half %a, half %b) {			define dso_local <vscale x 8 x half> @dupq_f16_repeat_complex(half %x, half %y) {
	; CHECK-LABEL: dupq_f16_repeat_complex:			; CHECK-LABEL: dupq_f16_repeat_complex:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0			; CHECK-NEXT: // kill: def $h0 killed $h0 def $z0
	; CHECK-NEXT: mov v2.16b, v0.16b
	; CHECK-NEXT: // kill: def $h1 killed $h1 def $q1			; CHECK-NEXT: // kill: def $h1 killed $h1 def $q1
	; CHECK-NEXT: mov v2.h[1], v1.h[0]			; CHECK-NEXT: mov v0.h[1], v1.h[0]
	; CHECK-NEXT: mov v2.h[2], v0.h[0]			; CHECK-NEXT: mov z0.s, s0
	; CHECK-NEXT: mov v2.h[3], v1.h[0]			; CHECK-NEXT: ret
	; CHECK-NEXT: mov v2.h[4], v0.h[0]			%1 = insertelement <8 x half> undef, half %x, i64 0
	; CHECK-NEXT: mov v2.h[5], v1.h[0]			%2 = insertelement <8 x half> %1, half %y, i64 1
	; CHECK-NEXT: mov v2.h[6], v0.h[0]			%3 = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> %2, i64 0)
	; CHECK-NEXT: mov v2.h[7], v1.h[0]			%4 = bitcast <vscale x 8 x half> %3 to <vscale x 4 x float>
	; CHECK-NEXT: mov z0.q, q2			%5 = shufflevector <vscale x 4 x float> %4, <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: ret			%6 = bitcast <vscale x 4 x float> %5 to <vscale x 8 x half>
	%1 = insertelement <8 x half> undef, half %a, i64 0			ret <vscale x 8 x half> %6
	%2 = insertelement <8 x half> %1, half %b, i64 1
	%3 = insertelement <8 x half> %2, half %a, i64 2
	%4 = insertelement <8 x half> %3, half %b, i64 3
	%5 = insertelement <8 x half> %4, half %a, i64 4
	%6 = insertelement <8 x half> %5, half %b, i64 5
	%7 = insertelement <8 x half> %6, half %a, i64 6
	%8 = insertelement <8 x half> %7, half %b, i64 7
	%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> %8, i64 0)
	%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
	ret <vscale x 8 x half> %10
	}			}

	define <vscale x 16 x i8> @ext_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {			define <vscale x 16 x i8> @ext_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
	; CHECK-LABEL: ext_i8:			; CHECK-LABEL: ext_i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ext z0.b, z0.b, z1.b, #255			; CHECK-NEXT: ext z0.b, z0.b, z1.b, #255
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%out = call <vscale x 16 x i8> @llvm.aarch64.sve.ext.nxv16i8(<vscale x 16 x i8> %a,			%out = call <vscale x 16 x i8> @llvm.aarch64.sve.ext.nxv16i8(<vscale x 16 x i8> %a,
	▲ Show 20 Lines • Show All 1,883 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -passes=instcombine < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				define dso_local <vscale x 4 x float> @dupq_f32_ab_pattern(float %x, float %y) {
				; CHECK-LABEL: @dupq_f32_ab_pattern(
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x float> poison, float [[X:%.]], i64 0
				; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x float> [[TMP1]], float [[Y:%.]], i64 1
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> poison, <4 x float> [[TMP2]], i64 0)
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast <vscale x 4 x float> [[TMP3]] to <vscale x 2 x i64>
				; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <vscale x 2 x i64> [[TMP4]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP6:%.*]] = bitcast <vscale x 2 x i64> [[TMP5]] to <vscale x 4 x float>
				; CHECK-NEXT: ret <vscale x 4 x float> [[TMP6]]
				;
				%1 = insertelement <4 x float> poison, float %x, i64 0
				%2 = insertelement <4 x float> %1, float %y, i64 1
				%3 = insertelement <4 x float> %2, float %x, i64 2
				%4 = insertelement <4 x float> %3, float %y, i64 3
				%5 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> poison, <4 x float> %4, i64 0)
				%6 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %5, i64 0)
				ret <vscale x 4 x float> %6
				}

				define dso_local <vscale x 8 x half> @dupq_f16_a_pattern(half %a) {
				; CHECK-LABEL: @dupq_f16_a_pattern(
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x half> poison, half [[A:%.]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <8 x half> [[TMP1]], <8 x half> poison, <8 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP3:%.*]] = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP2]], i64 0)
				; CHECK-NEXT: [[TMP4:%.*]] = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> [[TMP3]], i64 0)
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP4]]
				sdesmalenUnsubmitted Done Reply Inline Actions It's a pity this case isn't optimised. This is probably because instcombine has transformed it into a 'shufflevector' splat. You could do something like this: if (Value V = getSplatValue(CurrentInsertElt)) { InsertEltVec[0] = V; InsertEltVec.resize(1); } else { while (InsertElementInst InsertElt = dyn_cast<InsertElementInst>(CurrentInsertElt)) { ... } to handle it. sdesmalen: It's a pity this case isn't optimised. This is probably because instcombine has transformed it…
				MattDevereauAuthorUnsubmitted Done Reply Inline Actions I think the assembly output ends up as just a single MOV instruction and there isn't much to gain here (I'll double check this), the thinking behind this test was more just to prove the recursion can indeed get the smallest possible pattern `a`. MattDevereau: I think the assembly output ends up as just a single MOV instruction and there isn't much to…
				MattDevereauAuthorUnsubmitted Done Reply Inline Actions This case currently gets emitted as a neon mov: dup v0.8h, v0.h[0] mov z0.q, q0 ret Implementing your suggestion does change this to `mov z0.h, h0` but also drastically changes all tests in `sve-intrinsic-opts-cmpne.ll`. I think this optimization is best left for another patch. MattDevereau: This case currently gets emitted as a neon mov: ``` dup v0.8h, v0.h[0] mov z0.q, q0 ret ```…
				sdesmalenUnsubmitted Not Done Reply Inline Actions Fair enough, I didn't expect that would happen. sdesmalen: Fair enough, I didn't expect that would happen.
				;
				%1 = insertelement <8 x half> poison, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %a, i64 1
				%3 = insertelement <8 x half> %2, half %a, i64 2
				%4 = insertelement <8 x half> %3, half %a, i64 3
				%5 = insertelement <8 x half> %4, half %a, i64 4
				%6 = insertelement <8 x half> %5, half %a, i64 5
				%7 = insertelement <8 x half> %6, half %a, i64 6
				%8 = insertelement <8 x half> %7, half %a, i64 7
				%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %8, i64 0)
				%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
				ret <vscale x 8 x half> %10
				}

				define dso_local <vscale x 8 x half> @dupq_f16_ab_pattern(half %a, half %b) {
				; CHECK-LABEL: @dupq_f16_ab_pattern(
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x half> poison, half [[A:%.]], i64 0
				; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x half> [[TMP1]], half [[B:%.]], i64 1
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP2]], i64 0)
				; CHECK-NEXT: [[TMP4:%.*]] = bitcast <vscale x 8 x half> [[TMP3]] to <vscale x 4 x i32>
				; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <vscale x 4 x i32> [[TMP4]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP6:%.*]] = bitcast <vscale x 4 x i32> [[TMP5]] to <vscale x 8 x half>
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP6]]
				;
				%1 = insertelement <8 x half> poison, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %b, i64 1
				%3 = insertelement <8 x half> %2, half %a, i64 2
				%4 = insertelement <8 x half> %3, half %b, i64 3
				%5 = insertelement <8 x half> %4, half %a, i64 4
				%6 = insertelement <8 x half> %5, half %b, i64 5
				%7 = insertelement <8 x half> %6, half %a, i64 6
				%8 = insertelement <8 x half> %7, half %b, i64 7
				%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %8, i64 0)
				%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
				ret <vscale x 8 x half> %10
				}

				define dso_local <vscale x 8 x half> @dupq_f16_abcd_pattern(half %a, half %b, half %c, half %d) {
				; CHECK-LABEL: @dupq_f16_abcd_pattern(
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x half> poison, half [[A:%.]], i64 0
				; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x half> [[TMP1]], half [[B:%.]], i64 1
				; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x half> [[TMP2]], half [[C:%.]], i64 2
				; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x half> [[TMP3]], half [[D:%.]], i64 3
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP4]], i64 0)
				; CHECK-NEXT: [[TMP6:%.*]] = bitcast <vscale x 8 x half> [[TMP5]] to <vscale x 2 x i64>
				; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <vscale x 2 x i64> [[TMP6]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP8:%.*]] = bitcast <vscale x 2 x i64> [[TMP7]] to <vscale x 8 x half>
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP8]]
				;
				%1 = insertelement <8 x half> poison, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %b, i64 1
				%3 = insertelement <8 x half> %2, half %c, i64 2
				%4 = insertelement <8 x half> %3, half %d, i64 3
				%5 = insertelement <8 x half> %4, half %a, i64 4
				%6 = insertelement <8 x half> %5, half %b, i64 5
				%7 = insertelement <8 x half> %6, half %c, i64 6
				sdesmalenUnsubmitted Not Done Reply Inline Actions This case could still work right? If the two elements that are missing are both undef, they could be anything including `a, b`. sdesmalen: This case could still work right? If the two elements that are missing are both undef, they…
				MattDevereauAuthorUnsubmitted Done Reply Inline Actions The case can definitely work, however when re-implenting this patch as a recursive algorithm I ran into a few headaches when trying to integrate null pointers/poision values into the recursion. It is entirely possible to do this, however if there is minimal gain for the time required this I'd suggest it be done as a separate patch. Some cases I ran into issues with were how to handle cases such as: `<a, b, c, nullptr, a, b, c, d>`, where nullptr respresents poison elements. Logically you'd want to pick the right half as a pattern as it has no undefined values, but things start getting complicated with cases such as `<a, b, nullptr, nullptr, nullptr, nullptr, nullptr, d>`, and `<a, nullptr, a, nullptr, nullptr, b, nullptr, b>` It should be possible to simplify these, however I suspect it would be easier to write a separate algorithm from what I've done here to handle poison cases. MattDevereau: The case can definitely work, however when re-implenting this patch as a recursive algorithm I…
				sdesmalenUnsubmitted Done Reply Inline Actions I wouldn't expect the undef/poison case to be that difficult to handle. E.g. for `<a, b, _, _, _, _, _, d>` it could compare `<a, b, _, _>` with `<_, _, _, d>`, which would match the most specific values `<a, b, _, d>`. For `<a, _, a, _, _, b, _, b>` it could compare `<a, _, a, _>` with `<_, b, _, b>` which would match `<a, b, a, b>`, which could then be further broken down to `<a, b>`. sdesmalen: I wouldn't expect the undef/poison case to be that difficult to handle. E.g. for `<a, b, _, _…
				MattDevereauAuthorUnsubmitted Done Reply Inline Actions I had an "aha" moment when reading this comment. In the previous diff revision I was doing the recursive algorithm from the bottom up - i.e. splitting the vectors into two, passing them down the recursion chain, doing the comparison logic, and then passing it back up the recursion chain for further comparison which was causing a lot of strife with more complicated patterns and poison values. Instead it is much simpler to approach it from a top-down angle where the nullptr/poison/_ values get replaced immediately and the recursion call is in the return statement. The tests `dupq_f16_ab_pattern_no_front_indices`, `dupq_f16_ab_pattern_no_end_indices` and `dupq_f16_ab_pattern_no_end_indices` should show this in action. MattDevereau: I had an "aha" moment when reading this comment. In the previous diff revision I was doing the…
				%8 = insertelement <8 x half> %7, half %d, i64 7
				%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %8, i64 0)
				%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
				ret <vscale x 8 x half> %10
				}

				define dso_local <vscale x 8 x half> @dupq_f16_abcnull_pattern(half %a, half %b, half %c, half %d) {
				; CHECK-LABEL: @dupq_f16_abcnull_pattern(
				sdesmalenUnsubmitted Done Reply Inline Actions Is this test much different from `@dupq_f16_ab_pattern_no_end_indices`? sdesmalen: Is this test much different from `@dupq_f16_ab_pattern_no_end_indices`?
				MattDevereauAuthorUnsubmitted Done Reply Inline Actions This test would splat `<a, b, c, _>` as a 64bit value whereas `@dupq_f16_ab_pattern_no_end_indices` would splat `<a, b>` as a 32bit value, which could check the algorithm can distinguish between the two, however I suppose that difference is ultimately not so important so I'll remove the indices tests. MattDevereau: This test would splat `<a, b, c, _>` as a 64bit value whereas…
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x half> poison, half [[A:%.]], i64 0
				; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x half> [[TMP1]], half [[B:%.]], i64 1
				; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x half> [[TMP2]], half [[C:%.]], i64 2
				; CHECK-NEXT: [[TMP4:%.*]] = insertelement <8 x half> [[TMP3]], half [[A]], i64 4
				; CHECK-NEXT: [[TMP5:%.*]] = insertelement <8 x half> [[TMP4]], half [[B]], i64 5
				; CHECK-NEXT: [[TMP6:%.*]] = insertelement <8 x half> [[TMP5]], half [[C]], i64 6
				; CHECK-NEXT: [[TMP7:%.*]] = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP6]], i64 0)
				; CHECK-NEXT: [[TMP8:%.*]] = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> [[TMP7]], i64 0)
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP8]]
				;
				%1 = insertelement <8 x half> poison, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %b, i64 1
				%3 = insertelement <8 x half> %2, half %c, i64 2
				%4 = insertelement <8 x half> %3, half %a, i64 4
				%5 = insertelement <8 x half> %4, half %b, i64 5
				%6 = insertelement <8 x half> %5, half %c, i64 6
				%7 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %6, i64 0)
				%8 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %7, i64 0)
				ret <vscale x 8 x half> %8
				}

				define dso_local <vscale x 8 x half> @dupq_f16_abcd_pattern_double_insert(half %a, half %b, half %c, half %d) {
				; CHECK-LABEL: @dupq_f16_abcd_pattern_double_insert(
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x half> poison, half [[A:%.]], i64 0
				; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x half> [[TMP1]], half [[B:%.]], i64 1
				; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x half> [[TMP2]], half [[C:%.]], i64 2
				; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x half> [[TMP3]], half [[D:%.]], i64 3
				; CHECK-NEXT: [[TMP5:%.*]] = insertelement <8 x half> [[TMP4]], half [[A]], i64 4
				; CHECK-NEXT: [[TMP6:%.*]] = insertelement <8 x half> [[TMP5]], half [[B]], i64 5
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <8 x half> [[TMP6]], half [[C]], i64 6
				; CHECK-NEXT: [[TMP8:%.*]] = insertelement <8 x half> [[TMP7]], half [[C]], i64 7
				; CHECK-NEXT: [[TMP9:%.*]] = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP8]], i64 0)
				; CHECK-NEXT: [[TMP10:%.*]] = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> [[TMP9]], i64 0)
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP10]]
				;
				%1 = insertelement <8 x half> poison, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %b, i64 1
				%3 = insertelement <8 x half> %2, half %c, i64 2
				%4 = insertelement <8 x half> %3, half %d, i64 3
				%5 = insertelement <8 x half> %4, half %a, i64 4
				%6 = insertelement <8 x half> %5, half %b, i64 5
				%7 = insertelement <8 x half> %6, half %c, i64 6
				%8 = insertelement <8 x half> %7, half %d, i64 7
				sdesmalenUnsubmitted Done Reply Inline Actions If you swap `%c` and `%d` this at least becomes a negative test to ensure LLVM doesn't perform the wrong optimisation. That's better than it being a positive test where it just happens to do the right thing because another InstCombine rule has fired first, even though your code would have optimised this incorrectly. sdesmalen: If you swap `%c` and `%d` this at least becomes a negative test to ensure LLVM doesn't perform…
				sdesmalenUnsubmitted Not Done Reply Inline Actions Can you rename this test so that it's clear this is a negative test? (and also maybe add a comment?) sdesmalen: Can you rename this test so that it's clear this is a negative test? (and also maybe add a…
				%9 = insertelement <8 x half> %8, half %c, i64 7
				%10 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %9, i64 0)
				%11 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %10, i64 0)
				ret <vscale x 8 x half> %11
				}

				define dso_local <vscale x 8 x half> @dupq_f16_abcd_pattern_reverted_insert(half %a, half %b, half %c, half %d) {
				; CHECK-LABEL: @dupq_f16_abcd_pattern_reverted_insert(
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x half> poison, half [[A:%.]], i64 0
				; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x half> [[TMP1]], half [[B:%.]], i64 1
				; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x half> [[TMP2]], half [[C:%.]], i64 2
				; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x half> [[TMP3]], half [[D:%.]], i64 3
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP4]], i64 0)
				; CHECK-NEXT: [[TMP6:%.*]] = bitcast <vscale x 8 x half> [[TMP5]] to <vscale x 2 x i64>
				; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <vscale x 2 x i64> [[TMP6]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP8:%.*]] = bitcast <vscale x 2 x i64> [[TMP7]] to <vscale x 8 x half>
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP8]]
				;
				%1 = insertelement <8 x half> poison, half %d, i64 7
				%2 = insertelement <8 x half> %1, half %c, i64 6
				%3 = insertelement <8 x half> %2, half %b, i64 5
				%4 = insertelement <8 x half> %3, half %a, i64 4
				%5 = insertelement <8 x half> %4, half %d, i64 3
				%6 = insertelement <8 x half> %5, half %c, i64 2
				%7 = insertelement <8 x half> %6, half %b, i64 1
				%8 = insertelement <8 x half> %7, half %a, i64 0
				%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %8, i64 0)
				%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
				ret <vscale x 8 x half> %10
				}

				define dso_local <vscale x 8 x half> @dupq_f16_ab_no_front_pattern(half %a, half %b) {
				; CHECK-LABEL: @dupq_f16_ab_no_front_pattern(
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x half> poison, half [[A:%.]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = insertelement <8 x half> [[TMP1]], half [[A]], i64 1
				; CHECK-NEXT: [[TMP3:%.*]] = insertelement <8 x half> [[TMP2]], half [[A]], i64 2
				; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x half> [[TMP3]], half [[B:%.]], i64 3
				; CHECK-NEXT: [[TMP5:%.*]] = insertelement <8 x half> [[TMP4]], half [[A]], i64 4
				; CHECK-NEXT: [[TMP6:%.*]] = insertelement <8 x half> [[TMP5]], half [[B]], i64 5
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <8 x half> [[TMP6]], half [[A]], i64 6
				; CHECK-NEXT: [[TMP8:%.*]] = insertelement <8 x half> [[TMP7]], half [[B]], i64 7
				; CHECK-NEXT: [[TMP9:%.*]] = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP8]], i64 0)
				; CHECK-NEXT: [[TMP10:%.*]] = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> [[TMP9]], i64 0)
				; CHECK-NEXT: ret <vscale x 8 x half> [[TMP10]]
				;
				%1 = insertelement <8 x half> poison, half %a, i64 0
				%2 = insertelement <8 x half> %1, half %a, i64 1
				%3 = insertelement <8 x half> %2, half %a, i64 2
				%4 = insertelement <8 x half> %3, half %b, i64 3
				%5 = insertelement <8 x half> %4, half %a, i64 4
				%6 = insertelement <8 x half> %5, half %b, i64 5
				%7 = insertelement <8 x half> %6, half %a, i64 6
				%8 = insertelement <8 x half> %7, half %b, i64 7
				%9 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %8, i64 0)
				%10 = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %9, i64 0)
				ret <vscale x 8 x half> %10
				}

				declare <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half>, <8 x half>, i64)
				declare <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half>, i64)
				declare <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float>, <4 x float>, i64)
				declare <vscale x 4 x float> @llvm.vector.insert.nxv2f32.v2f32(<vscale x 4 x float>, <2 x float>, i64)
				declare <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float>, i64)
				declare <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.v4i32(<vscale x 4 x i32>, <4 x i32>, i64)
				declare <vscale x 4 x i32> @llvm.aarch64.sve.dupq.lane.nxv4i32(<vscale x 4 x i32>, i64)
				declare <vscale x 8 x i16> @llvm.vector.insert.nxv8i16.v8i16(<vscale x 8 x i16>, <8 x i16>, i64)
				declare <vscale x 8 x i16> @llvm.aarch64.sve.dupq.lane.nxv8i16(<vscale x 8 x i16>, i64)

				attributes #0 = { "target-features"="+sve" }
				sdesmalenUnsubmitted Done Reply Inline Actions Given the algorithm just compares two halves of the vector, I'm not sure there's much value in having both negative tests for "no_front_pattern", "no_middle_pattern" and "no_end_pattern" sdesmalen: Given the algorithm just compares two halves of the vector, I'm not sure there's much value in…
				MattDevereauAuthorUnsubmitted Done Reply Inline Actions These tests don't mean much without the poison value insertions now, so i'll remove them MattDevereau: These tests don't mean much without the poison value insertions now, so i'll remove them
				sdesmalenUnsubmitted Done Reply Inline Actions no end indices not poison? (perhaps there is a double negative here that confuses me?) In any case, the end indices are poison, because the test is inserting into `poison` sdesmalen: no end indices not poison? (perhaps there is a double negative here that confuses me?) In any…
				MattDevereauAuthorUnsubmitted Done Reply Inline Actions Some of these tests that appear redundant are leftover from the previous revisions where allowing inserts of poison values changed things. The first negative is the indices 6 & 7 missing from the `insertelement` chain. This second negative is because the first parameter of the vector insert is a non-poison value`c`unlike the other tests which are all poison. MattDevereau: Some of these tests that appear redundant are leftover from the previous revisions where…

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][InstCombine] Simplify repeated complex patterns in dupqlaneClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 488148

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/CodeGen/AArch64/sve-intrinsics-perm-select.ll

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll

[AArch64][InstCombine] Simplify repeated complex patterns in dupqlane
ClosedPublic