This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][InstCombine] Simplify repeated complex patterns in dupqlane
ClosedPublic

Authored by MattDevereau on Nov 17 2022, 5:35 AM.

Details

Summary

[AArch64][InstCombine] Simplify repeated complex patterns in dupqlane

Repeated floating-point complex patterns in dupqlane such as (f32 a, f32 b, f32 a, f32 b) can be simplified to shufflevector(f64(a, b), undef, 0)

Diff Detail

Event Timeline

MattDevereau created this revision.Nov 17 2022, 5:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 17 2022, 5:35 AM
MattDevereau requested review of this revision.Nov 17 2022, 5:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 17 2022, 5:35 AM

This is essentially an InstCombine version of the work done in https://reviews.llvm.org/D133116.

nlopes added inline comments.Nov 17 2022, 5:42 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1435

Please use PoisoValue here and whenever possible. We are trying to remove undef from LLVM.
Actually, here you can even use a different API: CreateInsertElement(Type *VecTy, Value *NewElt, Value *Idx)

Thank you!

Matt added a subscriber: Matt.Nov 17 2022, 8:45 AM
peterwaller-arm accepted this revision.Dec 13 2022, 1:26 AM

LGTM, but please address the comment about poison and allow time for other reviewers to chime in.

This revision is now accepted and ready to land.Dec 13 2022, 1:26 AM
tschuett added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1384

Please use std::optional. The LLVM variant is going away.

1428

Please use std::nullopt.

Could you maybe update the commit message, it seems to have the title twice :)

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1390

nit: VecTy->getScalarType() can be propagated into the few uses below.

1400–1402

nit: This can be moved to the start of the function, so that it bails out immediately if the conditions are not as expected.

1401

Should it also bail out if the vector being inserted is not a FixedLengthVector?

1408

nit: LLVM's coding style uses capitalised variable names, so I in this case.

1408

Does this algorithm work if the vector being inserted is not a power-of-two vector? (I guess when it is inserting into an 'undef' vector, you could still consider the other values to be anything you like and the algorithm may still work).

Could you also add a test-case for this?

1425–1428

This is a little bit restrictive, because it could also work for e.g.

<16 x i8> <a, b, c, d, a, b, c, d, a, b, c, d, a, b, c, d>

where only <a, b, c, d> would need to be splat.

It might help instead to find the 'minimum' set by recursively halving the vector and seeing if all elements match. e.g.

<a, b, a, b, a, b, a, b>
=> <a, b, a, b> == <a, b, a, b>
=> <a, b> == <a, b>

so that the minimum set to splat is <a, b>

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
86

This case could still work right? If the two elements that are missing are both undef, they could be anything including a, b.

MattDevereau added inline comments.Jan 3 2023, 8:19 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1425–1428

I've implemented a recursive function which now handles <a, b, c, d, a, b, c, d>

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
86

The case can definitely work, however when re-implenting this patch as a recursive algorithm I ran into a few headaches when trying to integrate null pointers/poision values into the recursion. It is entirely possible to do this, however if there is minimal gain for the time required this I'd suggest it be done as a separate patch.

Some cases I ran into issues with were how to handle cases such as:
<a, b, c, nullptr, a, b, c, d>, where nullptr respresents poison elements. Logically you'd want to pick the right half as a pattern as it has no undefined values, but things start getting complicated with cases such as <a, b, nullptr, nullptr, nullptr, nullptr, nullptr, d>, and <a, nullptr, a, nullptr, nullptr, b, nullptr, b> It should be possible to simplify these, however I suspect it would be easier to write a separate algorithm from what I've done here to handle poison cases.

sdesmalen added inline comments.Jan 4 2023, 3:21 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437

It's better to take and return an ArrayRef<Value*>, rather than a SmallVector object.

1438

nit: you can drop the std:: here.

1443

if the code is not going to modify the content of the array, it's better to use ArrayRef to avoid creating a new object and copying (existing) data.

1476–1477

When the loop breaks here, it will leave the other elements in InsertEltVec as nullptr from which you seem to infer poison or undef in SimplifyValuePattern.
That's not always correct, because the IR could be inserting into some other vector, e.g. <4 x i32> %arg where these values are defined (it would be good to have a test for this)

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
86

I wouldn't expect the undef/poison case to be that difficult to handle.

E.g. for <a, b, _, _, _, _, _, d> it could compare <a, b, _, _> with <_, _, _, d>, which would match the most specific values <a, b, _, d>.
For <a, _, a, _, _, b, _, b> it could compare <a, _, a, _> with <_, b, _, b> which would match <a, b, a, b>, which could then be further broken down to <a, b>.

MattDevereau marked 2 inline comments as done.Jan 4 2023, 8:58 AM
MattDevereau added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437

With the addition of allowing inserts into poison vectors to match patterns, the SmallVector now has to be mutable in order to replace poison values.

1443

With the addition of allowing inserts into poison vectors to match patterns, the SmallVector now has to be mutable in order to replace poison values.

1476–1477

This should now be ok with the inserting into poison logic added. I added the test dupq_f16_ab_pattern_no_end_indices_not_poison which I believe should cover this.

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
86

I had an "aha" moment when reading this comment. In the previous diff revision I was doing the recursive algorithm from the bottom up - i.e. splitting the vectors into two, passing them down the recursion chain, doing the comparison logic, and then passing it back up the recursion chain for further comparison which was causing a lot of strife with more complicated patterns and poison values. Instead it is much simpler to approach it from a top-down angle where the nullptr/poison/_ values get replaced immediately and the recursion call is in the return statement. The tests dupq_f16_ab_pattern_no_front_indices, dupq_f16_ab_pattern_no_end_indices and dupq_f16_ab_pattern_no_end_indices should show this in action.

sdesmalen added inline comments.Jan 5 2023, 8:23 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437

It's better to not pass/return large objects by value. If you can't use ArrayRef because you have to modify the values, perhaps you can pass it by reference, and make it both the input and output array (for example by truncating the result)

1438

You marked this as Done, but it seems unchanged.

1483

What happens if there is an insert into the same lane?

e.g.

%1 = insertelement <8 x half> %0, half %x, i64 1
%2 = insertelement <8 x half> %1, half %y, i64 1
1500–1501

Can this ever happen? I would expect the values in Pattern to always be defined at this point.

1503

At this point we know the Pattern.size() > 1, so can you hoist this out of the loop, and start the loop counter I at 1 instead of 0?

1522

nit: you can use auto here, because it's obvious from the RHS what the type will be

MattDevereau marked an inline comment as done.Jan 5 2023, 9:22 AM
MattDevereau added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437

I'm questioning whether this is adding gold plating. The SmallVector's are of pointers to values so it's just the SmallVector objects that are being copied, and this function will recurse at most 15 times. This will also need rewiring to use bool returns as the input and output vectors from this function are compared for equality to check if any pattern was found. It will be beneficial though so I'll have a go at implementing it.

1483

The latest insert is used (%y in this case) and the transform works ok, I'll add a test for this.

1500–1501

It can happen in cases such as <a, b, c, _, a, b, c, _> which I will add a test for

1503

Pattern.size() > 1 will be true but Pattern[0] can be null when the pattern <_, b, c, d> is evaluated. The previous revision hoisted this, but some tests were breaking as it was trying to create an InsertElement with a null 0th parameter . I think it makes sense to check it in the loop instead of hoisting it to make sure you don't do this.

MattDevereau marked an inline comment as done.
MattDevereau marked 2 inline comments as done.Jan 8 2023, 10:22 PM
MattDevereau added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437

I've re-written this to now use a single vector passed by reference.

1438

I've fully replaced references to std::size_t with size_t now

1483

I've added the test dupq_f16_abcd_pattern_double_insert to sve-intrinsic-dupqlane.ll which asserts this behaviour.

1500–1501

I've added the test dupq_f16_abcnull_pattern to sve-intrinsic-dupqlane.ll to assert this.

sdesmalen added inline comments.Jan 9 2023, 4:24 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1437

Thanks, that looks much better!

1468–1469

Why is this optimisation limited to integer types?

1471–1472

It might be easier to write this as:

Value *CurrentInsertElt = nullptr, *Default = nullptr;
if (!match(II, m_Intrinsic<Intrinsic::vector_insert>(m_Value(Default), m_Value(CurrentInsertElt), m_Value()) ||
    !isa<FixedVectorType(CurrentInsertElt->getType())

as that matches the operands in one go.

1481

Does this need to be a decrementing loop? Or could you write:

while (InsertElementInst *InsertElt =
        dyn_cast<InsertElementInst>(CurrentInsertElt)) {
    auto Idx = cast<ConstantInt>(InsertElt->getOperand(2));
    InsertEltVec[Idx->getValue().getZExtValue()] = InsertElt->getOperand(1);
    CurrentInsertElt = InsertElt->getOperand(0);
}
1491

You'll also need to check the lanes of the original vector being inserted into:

%1 = insertelement <4 x float> poison, float %x, i64 0   ; <--- this poison value must be checked instead of the llvm.vector.insert one.
%2 = insertelement <4 x float> %1, float %y, i64 1
%3 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> poison, <4 x float> %2, i64 0)
%4 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %3, i64 0)

However, you'll also need to check the default value of the llvm.vector.insert for this case:

%1 = insertelement <2 x float> poison, float %x, i64 0
%2 = insertelement <2 x float> %1, float %y, i64 1
%3 = tail call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v2f32(<vscale x 4 x float> poison, <2 x float> %2, i64 0) ; the full subvector is defined, but not all lanes in the subvector that's being dupq'ed.
%4 = tail call <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %3, i64 0)
MattDevereau marked 3 inline comments as done.
MattDevereau added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1468–1469

Oops, I ran into problems with integer types on the first pass I did on this patch but I can see the codegen is quite nice now with a single fmov:

fmov	s0, w0
mov	v0.h[1], w1
mov	v0.h[2], w2
mov	v0.h[3], w3
mov	z0.d, d0
ret

for example with <a, b, c d> as i16s. I'll remove this.

1471–1472

That looks a bit nicer, done

1481

It doesn't need to be a decrementing loop. This works well too so i've put it in

1491

We discussed outside the review that this would be ideally done in a follow-up patch.

Removed the logic to insert incomplete vectors into poison vectors.

Added a condition if (!isa<PoisonValue>(CurrentInsertElt) || !isa<PoisonValue>(Default) || to check all lanes of the subvector being inserted are defined, and the first inserted element is inserted into poison

I've left a bunch more comments, but overall this patch is moving in the right direction @MattDevereau!

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1443–1448

nit: please capitalise Lhs -> LHS and Rhs -> RHS (that's more common in LLVM for these variable names).

And feel free to ignore this suggestion, but I personally find this a little easier to read:

ArrayRef<Value*> Ref(Vec);                                       
auto LHS = Ref.take_front(HalfVecSize);
auto RHS = Ref.drop_front(HalfVecSize);
for (unsigned I=0; I<HalfVecSize; ++I) {
  if (LHS[I] != nullptr && RHS[I] != nullptr && LHS[I] == RHS[I])
    continue;

(i.e. using ArrayRef and indexing, rather than two changing pointers)

1451–1452

Can this condition be moved to the start of this function?

1470

nit: is it worth renaming this to Elts ? That's a bit simpler.

1471

nit: this can be auto, because the type is clear from the dyn_cast<InsertElementInst>.

1478

I don't think you'll need to check this, since you've already got the check for *Lhs != nullptr && *Rhs != nullptr in SimplifyValuePattern, which should always return false if any of the lanes is not explicitly defined. And because the size of InsertEltVec is based on the minimum number of elements of the scalable vector, it will also return false for the following case:

%0 = insertelement <2 x half> poison, half %a, i64 0
%1 = insertelement <2 x half> %0, half %b, i64 0
%2 = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v2f16(<vscale x 8 x half> poison, <2 x half> %1, i64 0)

Because InsertEltVec will be defined as:

[%a, %b, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr]
1483

I still think that the code is not correct yet, because for e.g.

...
%7 = insertelement <8 x half> %6, half %c, i64 6
%8 = insertelement <8 x half> %7, half %c, i64 7
%9 = insertelement <8 x half> %8, half %d, i64 7
%10 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %9, i64 0)

It starts at %9, so will first store %d to InsertEltVec[7].
Then it looks at %8, and it will store %c to InsertEltVec[7]

Which means the order is reversed.

The reason that your test still passes is that InstCombine has already removed the extra insertelement when it comes to this function.

Perhaps you can just return false if the element already exists in InsertEltVec?

1486

If you define InsertEltChain as PoisonValue::get(CurrentInsertElt->getType()), then you can start the loop at I = 0.

1489

I think you can do just Builder.getInt64(I) (and then you don't need the extra variable for it, and can propagate it directly into the expression below, and also drop the curly braces :)

1498

unsigned?

1499

unsigned?

1508

can just as well use auto or Value * for all these variables here, since their contents don't really matter too much.

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
31

It's a pity this case isn't optimised. This is probably because instcombine has transformed it into a 'shufflevector' splat. You could do something like this:

if (Value *V = getSplatValue(CurrentInsertElt)) {              
  InsertEltVec[0] = V;                                         
  InsertEltVec.resize(1);                                      
} else {                                                       
  while (InsertElementInst *InsertElt =                        
             dyn_cast<InsertElementInst>(CurrentInsertElt)) {  
  ...
}

to handle it.

95

Is this test much different from @dupq_f16_ab_pattern_no_end_indices?

137–138

If you swap %c and %d this at least becomes a negative test to ensure LLVM doesn't perform the wrong optimisation. That's better than it being a positive test where it just happens to do the right thing because another InstCombine rule has fired first, even though your code would have optimised this incorrectly.

238

no end indices not poison? (perhaps there is a double negative here that confuses me?)

In any case, the end indices are poison, because the test is inserting into poison

261

Given the algorithm just compares two halves of the vector, I'm not sure there's much value in having both negative tests for "no_front_pattern", "no_middle_pattern" and "no_end_pattern"

MattDevereau added inline comments.Jan 10 2023, 7:51 AM
llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
31

I think the assembly output ends up as just a single MOV instruction and there isn't much to gain here (I'll double check this), the thinking behind this test was more just to prove the recursion can indeed get the smallest possible pattern a.

95

This test would splat <a, b, c, _> as a 64bit value whereas @dupq_f16_ab_pattern_no_end_indices would splat <a, b> as a 32bit value, which could check the algorithm can distinguish between the two, however I suppose that difference is ultimately not so important so I'll remove the indices tests.

238

Some of these tests that appear redundant are leftover from the previous revisions where allowing inserts of poison values changed things.
The first negative is the indices 6 & 7 missing from the insertelement chain.
This second negative is because the first parameter of the vector insert is a non-poison value`c`unlike the other tests which are all poison.

261

These tests don't mean much without the poison value insertions now, so i'll remove them

MattDevereau marked 15 inline comments as done.
MattDevereau added inline comments.Jan 11 2023, 12:00 PM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1443–1448

I think there's pros and cons to this approach, in your example the setup looks nicer but I think the ability to resize the vector without passing indices through function parameters looks cleaner and makes the result easier to process. I will stick with the current implementation for now I think just to keep the review more compact.

1451–1452

Yes it can be, done

1478

Fair enough, removed.

1489

This loop no longer looks quite as hideous, thanks :)

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
31

This case currently gets emitted as a neon mov:

dup	v0.8h, v0.h[0]
mov	z0.q, q0
ret

Implementing your suggestion does change this to mov z0.h, h0 but also drastically changes all tests in sve-intrinsic-opts-cmpne.ll. I think this optimization is best left for another patch.

sdesmalen accepted this revision.Jan 13 2023, 6:50 AM

LGTM (please address my nit on the test before you land it)

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1443–1448

My suggestion was unrelated to resizing the vector, it was only meant for this loop. But I'm happy with the current version too.

llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll
31

Fair enough, I didn't expect that would happen.

137–138

Can you rename this test so that it's clear this is a negative test? (and also maybe add a comment?)