Page MenuHomePhabricator

[ARM][CodeGen] Add support for complex deinterleaving
Needs ReviewPublic

Authored by NickGuy on Nov 18 2021, 10:29 AM.

Details

Summary

Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

...
I'm not exactly sure why to do target specific pattern matching here. We could simply add generic complex intrinsics and the different patterns could be matched at each archs ISEL, no? I do agree that it should check in the backend if it should generate a complex operation of a given vector type.
...
Is there any strong reason to create target specific intrinsics instead of having generic intrinsics with polymorphism? Such as llvm.complex.add.v2f32?

I wanted to avoid adding target-independent intrinsics, as doing so might require substantial upstream consensus (Though this is being worked on in D119287). My thinking was that we can work on targeting our own architectures first, and then match incoming target-independent intrinsics after they are implemented. Additionally, IR intrinsics for Arm and AArch64 are already implemented, further reducing the amount of work required to enable initial support.

These difference can't be detected at isel?

Unless I'm missing something, there aren't any differences to be detected by isel. The distinction only applies when generating the relevant IR intrinsic, whereas isel is responsible for substituting the IR intrinsic with the instruction/s (in my understanding).

In a more global view, how do you plan to generate such input patterns detected here?

The patterns I've been focusing on are those emitted after loop vectorisation (which incidentally are also those emitted using std::complex with -ffast-math), because the MVE and Neon complex number instructions are vector instructions. Scalar patterns (like those in your linked snippet) are planned, but will require a bit more work to implement properly.

fhahn added a comment.Feb 9 2022, 9:00 AM

I think @jcranmer-intel is working on getting Clang to emit target-independent intrinsics for complex operations (see D119284 & linked patches). It might be good to sync up.

As Florian mentioned, I just re-uploaded a full stack of patches for complex intrinsics support, ranging from defining multiply and divide intrinsics, including an expansion for x86 architecture in both expansion to __mulsc3 and friends and full lowering to instructions, as well as building on top of them to finally get CX_LIMITED_RANGE support into clang. The most interesting patch is probably D119287, since that's the one that does all of the codegen work that this is largely doing, and I personally don't have sufficient expertise with ARM or AArch64 to design that code very well.

I haven't delved into the ARM-specific code in detail, but the ComplexArithmeticGraph feels like it's reinventing a lot of Instruction-like infrastructure just to avoid having to do anything with complex intrinsics. I may be biased here in thinking that it would be be better to move to standardized complex intrinsics, and I understand why you don't want to go there, but there are two questions I have:

  • What operations would you need from standardized complex intrinsics to completely eliminate all of logic in ComplexArithmeticGraph?
  • Have you tried an interface that lets targets nominate target-specific intrinsics for complex operations in lieu of creating a new graph ISA?
llvm/include/llvm/Transforms/Scalar/ComplexArithmetic.h
34

This says ComplexArithmetic, but it's mostly just limited to complex multiplication, right? There's no support for complex division or absolute value that I see (not that complex division is implemented in any hardware I'm aware of).

llvm/lib/Transforms/Scalar/ComplexArithmetic.cpp
447

In my experience with complex instructions, one of the big issues is that the varied frontends end up creating a variety of patterns for complex arithmetic, so you end up with a random choice of a complex number being represented in LLVM IR as a struct, a vector, or pairs of scalar numbers (especially as the ABIs for passing or returning them through functions is so insane). That's leaving aside the question of how the complex multiply itself is represented (is it expanded in IR, or a call to __mulsc3, or both?).

By starting the search for a complex multiply at a shufflevector, you're really leaving a lot of opportunities to match complex multiplies off the table. The pattern-matching I did in D119288 looks for insertvalue, insertelement, and matching stores for things that might be the result of complex multiplies or divisions.

(This may also be because you're aiming to do this after vectorization, and one of the problems that I'm trying to solve with complex intrinsics is that the patterns of complex instructions generated by the frontend aren't always kosher for vectorization).

...
I'm not exactly sure why to do target specific pattern matching here. We could simply add generic complex intrinsics and the different patterns could be matched at each archs ISEL, no? I do agree that it should check in the backend if it should generate a complex operation of a given vector type.
...
Is there any strong reason to create target specific intrinsics instead of having generic intrinsics with polymorphism? Such as llvm.complex.add.v2f32?

I wanted to avoid adding target-independent intrinsics, as doing so might require substantial upstream consensus (Though this is being worked on in D119287). My thinking was that we can work on targeting our own architectures first, and then match incoming target-independent intrinsics after they are implemented. Additionally, IR intrinsics for Arm and AArch64 are already implemented, further reducing the amount of work required to enable initial support.

My understanding was that this had very little to do with upstream consensus and more to do with generating optimal code. I probably didn't say this clearly enough but as far as I see it this pass shouldn't be matching "complex multiply". It should be matching "partial multiply with rotate".

AArch64 (under certain architecture options) has an instruction that is called fcmla. (Arm MVE has similar instructions, which this pass is targeting at the moment). It takes a vector and a "rotate" that can be 0, 90, 180 or 270. It performs something like (d0=d0+s0*t0, d1=d1+s0*t1) for the odd/even lanes with a rotate of #0. Or (d0=d0-s1*t1, d1=d1+s1*t0) with a rotate of #90. You can combine two fcmla with rotations of 0 and 90 to produce a single complex multiply, but they can also be combined in all kinds of other weird and wonderful ways. There is no single "complex multiply" instruction for AArch64/MVE, and if you limit the optimizer to just acting on complex multiply, you are always going to be leaving performance on the table from all the other patterns that could be selected.

These difference can't be detected at isel?

Unless I'm missing something, there aren't any differences to be detected by isel. The distinction only applies when generating the relevant IR intrinsic, whereas isel is responsible for substituting the IR intrinsic with the instruction/s (in my understanding).

This was written pre-ISel so that it could be shared between AArch64 and Arm MVE. It also makes looking at chunks of related instructions at the same time easier, in the same way as the MVE lane interleaving pass does (you look up to sources and down to sinks and whatnot).

In a more global view, how do you plan to generate such input patterns detected here?

The patterns I've been focusing on are those emitted after loop vectorisation (which incidentally are also those emitted using std::complex with -ffast-math), because the MVE and Neon complex number instructions are vector instructions. Scalar patterns (like those in your linked snippet) are planned, but will require a bit more work to implement properly.

I think @jcranmer-intel is working on getting Clang to emit target-independent intrinsics for complex operations (see D119284 & linked patches). It might be good to sync up.

As Florian mentioned, I just re-uploaded a full stack of patches for complex intrinsics support, ranging from defining multiply and divide intrinsics, including an expansion for x86 architecture in both expansion to __mulsc3 and friends and full lowering to instructions, as well as building on top of them to finally get CX_LIMITED_RANGE support into clang. The most interesting patch is probably D119287, since that's the one that does all of the codegen work that this is largely doing, and I personally don't have sufficient expertise with ARM or AArch64 to design that code very well.

Yeah thanks, we saw the updates. I had already left a message on https://reviews.llvm.org/D114398 a long time ago, but I think it was missed because the patch was a draft, and apparently that makes it fairly hidden for a phabriactor review.

I still have mis-givings about a generic complex intrinsics being the correct representation for the mid-end of llvm. As far as I understand at the moment they would just block fmul/fadd/etc folds that might already be happening, and block any vectorization (which from what I understand of the problem domain of complex numbers is really important!)

As I said above, it fells difficult for me to see how they will produce the most optimial solution, given the instrucions that are really available. And I don't yet have any examples of anything it really makes better/easier to optimize.

Happy to hear your thoughts on that, my experience with complex numbers mainly comes from people using arrays of interleaved real/imaginary numbers, not _Complex or std::complex. From my understanding of what Nick had tried, they would all vecotorize so that this pass (perhaps opportunistically) could match to something better for the target. From looking at the patches of the complex intrinsic they look much more focussed on the clang representation and whether they will overflow.

Thanks all for the comments so far (And thanks Dave for taking on the evening shift, as it were)

I haven't delved into the ARM-specific code in detail, but the ComplexArithmeticGraph feels like it's reinventing a lot of Instruction-like infrastructure just to avoid having to do anything with complex intrinsics.

In hindsight, calling it a graph may be a misnomer; It acts more as a registry for attaching metadata to instructions, sitting alongside the Instruction use-def graph.

What operations would you need from standardized complex intrinsics to completely eliminate all of logic in ComplexArithmeticGraph?

Having operations that can map to the Arm/AArch64 instructions would be a requirement for us to solely depend on them. Beyond those defined in D119284; Addition with rotations, and multiplying both complex components by a real/imaginary number.
That said, having the frontend emit numerous intrinsics just for Arm/AArch64 might set a bad precedent. I can imagine a case where different architectures have different ideas of what a complex multiply looks like, and each wants it's own idea reprensented by the standard intrinsics. It might be a better idea to have the frontend emit intrinsics for the concepts of a complex multiply, while targets can then match the specific patterns they're interested in beyond those.

Have you tried an interface that lets targets nominate target-specific intrinsics for complex operations in lieu of creating a new graph ISA?

I mighr be missing something, but I'm not sure I follow how that would be different from delegating down to the TTI to generate the intrinsic. As I alluded to in a previous comment, the major problem with having common intrinsic building is that parameters are different across architectures.

This says ComplexArithmetic, but it's mostly just limited to complex multiplication, right? There's no support for complex division or absolute value that I see (not that complex division is implemented in any hardware I'm aware of).

Multiplication and addition in it's current state. It was designed through an Arm-tinted lens, so only supports the operations we have instructions for.

By starting the search for a complex multiply at a shufflevector, you're really leaving a lot of opportunities to match complex multiplies off the table. The pattern-matching I did in D119288 looks for insertvalue, insertelement, and matching stores for things that might be the result of complex multiplies or divisions.

I've seen insertelement emitted in the scalar versions, which is on my list. But we've focused on matching the vector forms first, as those are where we saw the best potential gains in our preliminary investigations.

...my experience with complex numbers mainly comes from people using arrays of interleaved real/imaginary numbers, not _Complex or std::complex. From my understanding of what Nick had tried, they would all vecotorize so that this pass (perhaps opportunistically) could match to something better for the target...

That's correct, both cases (arrays of interleaved, and std::complex) are vectorised the same way, and produce the same IR. Though using std::complex does need -ffast-math to prevent generation of __mulsc3 and co.

NickGuy updated this revision to Diff 416915.Mar 21 2022, 6:54 AM
NickGuy edited the summary of this revision. (Show Details)

It's been longer than I would've liked before I've gotten back to this, but I've;

  • Added support for matching partial complex multiplication patterns, where a complex number is multiplied by either the real or imaginary component.
  • Moved the lowering process from TTI to TLI.
  • Minimised the amount of data passed to the TLI, to only expose what is needed (and not implementation details).

I had also experimented with the idea of taking a case where complex numbers are multiplied by real numbers, but the
performance overhead of expanding the real number vector to match the expected pattern of our complex instructions (<a, _, b, _, c, _, ...>)
caused any gains to be marginal at best, and substantial regressions were seen almost across the board.

I don't intend on adding any new patterns to this patch, further patterns (and AArch64 support) will be implemented in follow-up patches.

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 6:54 AM

OK - we can keep this going - we know it should give some decent performance. We can see where we go from there if something else materializes that can give the same gains.

At a high level - what this pass is doing either seems to be too complicated or too simple. What it effectively seems to do is pattern match shuffle(fadd/fmul/etc(shuffle, shuffle)) is a way that matches one of several patterns, and generates target intrinsics from that. If that is all that is needed then the "Graph" that gets created doesn't seem necessary. It only ever stores a single node, so isn't much of a graph. It is being more complicated than it needs to be.

But I imagine in practice that a lot of math routines will do more than a single complex multiply. There will be multiple parts, this should be building a graph that is made up of several nodes bit at a time and generating code from it only if it can create a full graph that looks profitable to generate.

There are a number of other comments below.

llvm/include/llvm/CodeGen/ComplexArithmetic.h
1 ↗(On Diff #416915)

ComplexArithmetic is a bit of a general name for what this pass is trying to achieve. What about something like ComplexDeinterleavingPass? It better explains that the pass is performing the deinterleaving from shuffles.

20 ↗(On Diff #416915)

Is this needed? Its strange to a macro like this and I'm not sure it's very useful for the places it is used.

59 ↗(On Diff #416915)

It doesn't seem that it would be necessary to combine ComplexArithmeticOperation operations together. It doesn't make much sense to have something that is both an Add _and_ a Partial Mul.

61 ↗(On Diff #416915)

Is this just needed to pass data between the pass and the TLI? It's probably simpler (and more like the existing methods) to just pass the parameters individually.

llvm/include/llvm/CodeGen/Passes.h
85 ↗(On Diff #416915)

I think this line can be removed. The other comments here are /// doxygen comments.

llvm/include/llvm/CodeGen/TargetLowering.h
2963 ↗(On Diff #416915)

This isn't needed if the target adds the pass itself. There will probably need to be a way to specify _which_ patterns the target supports for a given type. For example MVE has both integer and floating point complex operations. If the subtarget has only MVE (not MVE.fp), then it needs to support the integer complex operations, without supporting floating point. Other differences could exist like one architecture supporting a different subset of operations.

2969 ↗(On Diff #416915)

Is GeneratedIntrinsicCount just for statistics? If so it doesn't seem to be worth complicating the interface for.

llvm/lib/CodeGen/ComplexArithmetic.cpp
40 ↗(On Diff #416915)

This leaks memory. Same for the ones below. They can probably return a SmallVector<int>.

107 ↗(On Diff #416915)

Why are Real and Imaginary needed?

110 ↗(On Diff #416915)

Loads and Stores don't appear in the graph.

112 ↗(On Diff #416915)

Why is AddOperand needed? What does it mean?

113–114 ↗(On Diff #416915)

How is Preserve different to Input?

116 ↗(On Diff #416915)

What does this refer to?

119–120 ↗(On Diff #416915)

What does this mean and refer to?

321 ↗(On Diff #416915)

This looks like debugging code that should be removed. There is already an option to disable it.

660–661 ↗(On Diff #416915)

Remove commented out code.

703 ↗(On Diff #416915)

This shouldn't be something that is hard-coded in the pass. It might make more sense to push the splitting up of larger types to the target. There are cases like fixed-length SVE where the width won't be limited to 128bit vectors, and pushing that out to the target will be more general.

804 ↗(On Diff #416915)

This logic is all a bit strange. Why have Changed?

for (auto &I : *B) {
  if (auto *SVI = dyn_cast<ShuffleVectorInst>(&I)) {
    if (isInterleaving(SVI))
      if (traverseAndPopulateGraph(TLI, SVI, Graph))
        Substituted |= substituteGraph(TLI, &I, Graph, DeadInsts);
820 ↗(On Diff #416915)

I think this is just RecursivelyDeleteTriviallyDeadInstructions

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
15

These are probably unneeded?

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll
2 ↗(On Diff #416915)

I would use run lines that that are similar to the llvm/test/CodeGen/Thumb2/mve-xyz.ll tests. Something like:

; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs %s -o - | FileCheck %s

It's best not to use a specific cpu, using the arch instead.

The tests can probably go in the same place too, under llvm/test/CodeGen/Thumb2/mve-xyz.ll for mve tests.

5 ↗(On Diff #416915)

This probably isn't worth testing if it is testing codegen already.

33 ↗(On Diff #416915)

A lot of these tests are strange - they seem to have infinite loops?

I think you should be able to remove most of it to make a much cleaner test. It doesn't even need the loads/stores:

define <8 x half> @complex_add_v8f16(<8 x half> %a, <8 x half> %b) {
entry:
  %strided.vec = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec21 = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %strided.vec23 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec24 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %0 = fsub fast <4 x half> %strided.vec23, %strided.vec21
  %1 = fadd fast <4 x half> %strided.vec24, %strided.vec
  %interleaved.vec = shufflevector <4 x half> %0, <4 x half> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
  ret <8 x half> %interleaved.vec
}
NickGuy updated this revision to Diff 421520.Apr 8 2022, 7:24 AM
NickGuy marked 22 inline comments as done.
NickGuy edited the summary of this revision. (Show Details)
NickGuy added inline comments.
llvm/include/llvm/CodeGen/ComplexArithmetic.h
20 ↗(On Diff #416915)

It was more useful (though not necessarily needed) in an earlier iteration. I've removed it now.

59 ↗(On Diff #416915)

As above, useful in an earlier iteration. Also removed

llvm/lib/CodeGen/ComplexArithmetic.cpp
40 ↗(On Diff #416915)

Good catch, fixed.

119–120 ↗(On Diff #416915)

This entire enum is no longer required, so I've removed it all.

703 ↗(On Diff #416915)

I've pushed the 128 part to be provided by the target, but couldn't figure out a nice way to do the splitting entirely on the target. I'll be revisiting this when I have concrete examples of how scalable vectors will work.

804 ↗(On Diff #416915)

Holdover from an early iteration, where I tried to build the whole graph before changing anything. Simplified

820 ↗(On Diff #416915)

Looks to do the job, thanks

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll
33 ↗(On Diff #416915)

These were initially generated from a larger IR, and pushed through llvm-reduce. I've rewritten them manually to be much smaller (basing them off of your snippet).

mgabka added a subscriber: mgabka.Apr 14 2022, 3:25 AM
huntergr added inline comments.May 30 2022, 2:03 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
482 ↗(On Diff #421520)

I can't find any place where rotations are dealt with besides assignment and equality checks of 0,90,180,270 -- might an enum be preferable? Or do you anticipate doing arithmetic with them in a follow-up patch?

690 ↗(On Diff #421520)

I think it would be worthwhile to add a comment about what the interleaving represents -- that you're looking for a shuffle that takes separate vectors of real and imaginary parts and combines them before they are stored to memory (or returned in registers), and that this is just for matching per-lane operations instead of cross-lane (like a reduction).

Or at least that's the behaviour I've observed when testing your patch with a loop like the following:

#define LEN (512)
float _Complex a[ LEN ];
float _Complex b[ LEN ];
float _Complex c[ LEN ];

void foo (void) {
#pragma clang loop vectorize(enable)
  for (int i = 0; i < LEN; ++i)
    a[i] = b[i] * c[i];
}
NickGuy updated this revision to Diff 438664.Tue, Jun 21, 5:55 AM
NickGuy marked 2 inline comments as done.

Since the last patch, I've redesigned the high-level approach; It now crawls through the instruction graph to find opportunities for complex deinterleaving before attempting to perform said deinterleaving. Doing it this way allows us to short-circuit, bailing out and preventing the heavy work from being performed if something isn't supported.
This iteration also implements support for complex operations with multiple steps/operands (e.g. a[i] * b[i] * c[i])

I haven't really looked to deeply into the meat of the pass yet - how it does the matching - but I had a chunk of comments for the rest.

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes? It would build up a graph of complex nodes to replace the original shuffle, providing that the leaves were all valid de-interleavings. At least on paper I feel we should just be able to perform the search without looking though a lot of uses (except for checking there are no extra uses, of course), and all the Unmatched instructions and what looks like matching of random unrelated snippets would be cleaned up.

llvm/lib/CodeGen/CMakeLists.txt
49 ↗(On Diff #438664)

Formatting is off here

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
1–2 ↗(On Diff #438664)

This looks like it got formatted incorrectly.

99 ↗(On Diff #438664)

item -> Item

100 ↗(On Diff #438664)

i -> I (or a better name, if I is already used)

149 ↗(On Diff #438664)

Reciprocal Throughput is more common.

511 ↗(On Diff #438664)

evaluateComplexDeinterleavingBasicBlock -> evaluateBasicBlock

991–992 ↗(On Diff #438664)

It is better to structure things in a way where we decide whether to do something, then do it. Not get half way through doing it and decide we didn't want to in the end. In what ways would we expect createComplexDeinterleavingIR to return nullptr at the moment?

1017 ↗(On Diff #438664)

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

llvm/lib/Target/ARM/ARMISelLowering.cpp
21738–21742 ↗(On Diff #438664)

Things like this can be a single if:

if (!VTy || VTy->getNumElements() * VTy->getScalarSizeInBits() != 128)
    return false;

I presume the == 128 can be removed if we teach it how to split the vectors up?

21773 ↗(On Diff #438664)

Can this be an assert instead.

llvm/lib/Target/ARM/ARMTargetMachine.cpp
5–1

This doesn't need to add the new brackets.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-mul.ll
32 ↗(On Diff #421520)

Add arm_aapcs_vfpcc to any tests that take or return vectors.
I think you can remove #0 too.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f64-mul.ll
2 ↗(On Diff #438664)

It might be a little cleaner to add +fp64 for all these f64 tests.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-rotations-add.ll
1 ↗(On Diff #421520)

MVE tests can go into llvm/test/CodeGen/Thumb2/mve-complex-xyz.ll. So long as they are all updated by the test script, that should be fine to keep them with the rest.

NickGuy added a comment.EditedWed, Jun 22, 4:02 AM

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?

While that would work "simply" enough for cases like a * b, more elaborate cases (e.g. (a * b) * c) would result in some ambiguity as to which add(mul, mul) pairs with which sub(mul, mul). This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?

Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like DAGTypeLegalizer::SplitVectorResult).

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Remaining comments will be addressed in a follow-up patch.

chill added a subscriber: chill.Tue, Jun 28, 3:28 AM
chill added inline comments.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
136 ↗(On Diff #438664)

Here and elsewhere std::find -> llvm::find

593 ↗(On Diff #438664)

Why are we scanning the whole block?
It looks to me (admittedly I don't understand what this pass does yet)
we can just walk over the Instructions vector (in reverse, if the order matters),
avoiding the quadratic complexity.

633 ↗(On Diff #438664)

Here and in a few other places for (Value *V : Op0->operands()) {

chill added a comment.Tue, Jun 28, 3:30 AM

Do we have some broad overview of the approach and the algorithm? It'd be a good idea put something like this in the description and eventually the commit message. (I searched for LLVM complex RFCs, buy couldn't find anything useful).

chill added inline comments.Tue, Jun 28, 7:42 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
325 ↗(On Diff #438664)

return std::make_shared<ComplexDeinterleavingCompositeNode>(Operation);

561 ↗(On Diff #438664)

These could be, e.g.:

static const int RealMask[] = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30};
auto RealMaskRef = ArrayRef<int>(RealMask, ShufleMask.size());

with an assertion/bounds check. Good enough for 512-bit vectors with 16-bit elements, can be extended.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99 ↗(On Diff #438664)

Shouldn't these be translated to a couple of vcadd.f32 instructions, like in the previous test?
And this amount of move instructions seems excessive.

chill added inline comments.Tue, Jun 28, 7:46 AM
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99 ↗(On Diff #438664)

And this amount of move instructions seems excessive.

I guess MVE does not have sensible swizzling instructions.

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?

While that would work "simply" enough for cases like a * b, more elaborate cases (e.g. (a * b) * c) would result in some ambiguity as to which add(mul, mul) pairs with which sub(mul, mul). This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

Hmm, OK. I was thinking about a three way multiply when looking at that way of structuring the pass, but only one pattern and only ever on paper. I was worried that there were cases where it was ambiguous, but figured if it was it could always just try both possibilities. But I've not implemented it, it just sounded like an elegant way of treating this like slightly more complex pattern matcher, as opposed to all this findUnmatchedInstructions and looking through uses.

A broad overview of the algorithm would be useful, like Momchil mentioned, perhaps in the file description. I was trying to make a sketch before but got a bit lost in nested loops.

I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?

Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like DAGTypeLegalizer::SplitVectorResult).

A separate addition sounds good. Lets try and get something that we happy with and extend it out.

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Do you have a test-case where this happens? Cost modelling will probably be good to add in the long run, I'd just like to understand where it currently can make things worse.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99 ↗(On Diff #438664)

Yeah this is expected from shuffles that MVE cannot handle very well. It would look a lot better either with loads that could become interleaving loads, or under AArch64 where better shuffles are available.

NickGuy updated this revision to Diff 442045.Mon, Jul 4, 2:55 AM
NickGuy marked 6 inline comments as done.
NickGuy retitled this revision from [ARM][CodeGen] Add support for complex addition and multiplication to [ARM][CodeGen] Add support for complex deinterleaving.

Addressed comments, and added a file description that attempts to explain what the pass does.

NickGuy marked 15 inline comments as done.Mon, Jul 4, 3:42 AM
NickGuy added inline comments.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
593 ↗(On Diff #438664)

The order of the CompositeNodes is important, this was a holdover from before we were sorting the nodes.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99 ↗(On Diff #438664)

Shouldn't these be translated to a couple of vcadd.f32 instructions, like in the previous test?

Not in this case, no. The IR vectors are too wide to fit into actual vector registers, and the follow-up vector splitting patch (D129067) is restricted to splitting in half.