This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Enable processing of interconnected complex number operations
ClosedPublic

Authored by igor.kirillov on Mar 27 2023, 10:18 AM.

Details

Summary

With this patch, ComplexDeinterleavingPass now has the ability to handle
any number of interconnected operations involving complex numbers.
For example, the patch enables the processing of code like the following:

for (int i = 0; i < 1000; ++i) {

a[i] =  w[i] * v[i];
b[i] =  w[i] * u[i];

}

This code has multiple arrays containing complex numbers and a common
subexpression w that appears in two expressions.

Diff Detail

Event Timeline

igor.kirillov created this revision.Mar 27 2023, 10:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 27 2023, 10:18 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
igor.kirillov requested review of this revision.Mar 27 2023, 10:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 27 2023, 10:18 AM

Remove old comment

NickGuy added inline comments.Mar 28 2023, 3:29 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
143

What's the rationale behind removing this? I might be missing something, but it looks like you're removing it then looking for the internal instructions on-demand in checkNodes later

905–910

How is this expected to behave with operands that leave the chain? (i.e. Phi nodes or instructions in a different basic block). Is there a risk of AllInstructions containing the most of the function in some cases?

I am planning to add several more patches after this one, including:

  1. Support for scalable vectors
  2. Reductions
  3. Full -Ofast mode

With regards -Ofast, the current code is suitable for simple cases where only one multiplication or addition is involved. However, when -Ofast flag is set, the compiler may rearrange the order of instructions, causing the real and imaginary parts to not run in parallel. For example, a loop like the following may not be processed:

for (int i = 0; i < N; ++i)
  u[i] = v[i] * w[i] + u[i] * y[i];
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
143

Ah, yes I should explain. I'll be adding support for -Ofast (see top comment), which might cause some issues. Basically, the problem is that the ComplexDeinterleavingNode can't be attached to ComplexDeinterleavingNode::Imag and ComplexDeinterleavingNode::Real instructions, which means that the submitCompositeNode function won't be able to fill InternalInstructions.

To solve this, I've decided to split the detection and checking process into two stages. I think it is more straightforward.

905–910

PHINodes will also act as as leaves and be in FinalInstructions, and instructions from different block are not problematic as long as we ensure that there are no other uses, which we do.

In case that a loop has a large number instructions, it will still be processed in the same manner and I don't see any problems.


As FCMLA/FCADD are vector instructions, our focus is mainly on vector loops. Therefore, I think it is more beneficial to perform this check on a BasicBlock level rather then a Function level. Alternatively, we could apply this pass per Loop, but this would result in losing support for some Neon cases that can be generated by the autovectorizer outside of a loop.

Matt added a subscriber: Matt.Apr 3 2023, 1:56 PM
mgabka added a subscriber: mgabka.Apr 4 2023, 1:58 AM
NickGuy accepted this revision.Apr 4 2023, 7:12 AM

LGTM

As FCMLA/FCADD are vector instructions, our focus is mainly on vector loops. Therefore, I think it is more beneficial to perform this check on a BasicBlock level rather then a Function level. Alternatively, we could apply this pass per Loop, but this would result in losing support for some Neon cases that can be generated by the autovectorizer outside of a loop.

That makes sense in this case, but keep in mind that scalar loops (or in the case of neon & complex doubles, even non-looping operations) can be lowered to use xcmla/xcadd, I'm not sure we should rely on the concept of a loop for this.
The scalar loops and co. are something that we can look into in the future though, trying to do so now will only complicate things and bloat the patch.

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
2

Nit: Probably don't need +fullfp16 in this test, as half isn't being used.

This revision is now accepted and ready to land.Apr 4 2023, 7:12 AM
dmgreen added a subscriber: dmgreen.Apr 5 2023, 7:37 AM

Hello. Nick pointed me at the patch but I haven't looked at the details a huge amount. checkNodes feels quite complex, but handling multiple users seems like a nice addition to the pass. The pass should be able to handle multiple basic blocks just fine, but limiting multiple uses to all start in the same block sounds like a good compromise between complexity and functionality.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
892

Worklist is a commonly used name in LLVM

921

A user of an instruction will always be an instruction.

926

This seems like it adds the user OpI to OuterInstructions, just so that later is can look at the operands and add I to ToDo. Can it just add I directly?

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
93

It is not obvious to me why d * c should prevent the rest of the tree transforming. Can you precommit the tests to show the differences in the review?

NickGuy added inline comments.Apr 5 2023, 8:14 AM
llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
93

I'm not sure it's the d * c that's preventing it, so much as the store to %p1 (It doesn't reinterleave, so the store is deemed "outside the graph").
Though I agree, pre-commits would be useful to see the difference in the other cases

igor.kirillov added inline comments.Apr 5 2023, 1:57 PM
llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
93

The idea is that any external use of complex computation chain prevents deinterleaving. And if one chain stops to be deinterleaved it might effect another chain, which is shown in this test. *p1 prevent for a*b to be deinterleaved, which in its turn prevents (a * b * c) to be deinterleaved and that in its turn spoils d*c expression. As result nothing is deinterleaved.

At least that's what I thought before pre-commiting this tests where pre-patch code decided to deinterleaved d * c operation - https://reviews.llvm.org/D147659. Now, I have a question. If shufflevector is used outside, should we still deinterleave code?

NickGuy requested changes to this revision.Apr 6 2023, 3:51 AM
NickGuy added inline comments.
llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
93

If shufflevector is used outside, should we still deinterleave code?

As long as the shufflevector used is the reinterleaving one we're fine to do so, as it should simply replace only what it can and preserve the rest, but that feels very much like a cost model problem to me.

The idea is that any external use of complex computation chain prevents deinterleaving. And if one chain stops to be deinterleaved it might effect another chain, which is shown in this test. *p1 prevent for a*b to be deinterleaved, which in its turn prevents (a * b * c) to be deinterleaved and that in its turn spoils d*c expression. As result nothing is deinterleaved.

In this case though, I don't see a reason why d * c shouldn't be replaced, it's not part of any other expression so should be able to be replaced in isolation. For that matter, the assignment to p1 is the only expression that I'd expect not to be transformed, the rest should be able to go through fine (it would be a different story if it was something like *p2 = (a * b) * p1, but the lack of data dependencies between them means they can and should be treated in isolation of eachother)

This revision now requires changes to proceed.Apr 6 2023, 3:51 AM
dmgreen added inline comments.Apr 6 2023, 4:57 AM
llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
93

Sorry I apparently forgot to send this as a reply to @igor.kirillov

Yeah that was my point, but it will depend on whether the shuffles get folded into a ld2 or need to be replicated. On it's own replacing fadd/fmul(shuffle, shuffle) with fcmla will be profitable even if the shuffles have other uses. But if the shuffles could be combined into a ld2 it isn't as obvious.

One of the earlier version of this pass has a costmodel that tried work out when it was profitable but I suggested removing it to simplify the patch, as it wasn't very useful at the time. I would guess that adding multiple disparate subgraphs into the same graph make that more difficult to do consistently, but in this case we can perhaps just check the shuffles operand to see if it is a load. It is probably worth having tests for both, and giving them better names.

igor.kirillov added inline comments.Apr 6 2023, 7:55 AM
llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
93

@dmgreen I am not sure I understood correctly which tests to add. But anyway I readjusted this one and added a new that shows how external use might infect operation graph - https://reviews.llvm.org/D147659

If final nodes have external users, it will not prevent complex deinterleaving. Refactoring

igor.kirillov marked 10 inline comments as done.Apr 11 2023, 6:26 AM
NickGuy added inline comments.Apr 11 2023, 8:57 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
921

Nit: If we know for a fact that it's an Instruction, it might be better to use cast rather than dyn_cast

953

As above, cast instead of dyn_cast

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
4–5

Why is it not? The presence of fcmla in the expected output tells me that it is expected to transform

Update comment and change dyn_cast to cast where it was required

igor.kirillov marked 3 inline comments as done.Apr 11 2023, 9:11 AM
igor.kirillov added inline comments.
llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
4–5

Yes! Forgot to update after rebasing on top of the pre-commit.

NickGuy accepted this revision.Apr 14 2023, 5:49 AM

Thanks for the ping, the recent changes passed me by completely.

This looks good to me, (with the caveat that cost modeling is missing at the moment). Though I might suggest waiting a couple more days before merging, in case someone notices anything else I missed :)

This revision is now accepted and ready to land.Apr 14 2023, 5:49 AM