Page MenuHomePhabricator

[ARM][CodeGen] Add support for complex addition and multiplication
Needs ReviewPublic

Authored by NickGuy on Nov 18 2021, 10:29 AM.

Details

Summary

Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.

Diff Detail

Event Timeline

NickGuy created this revision.Nov 18 2021, 10:29 AM
NickGuy requested review of this revision.Nov 18 2021, 10:29 AM
fhahn added a subscriber: fhahn.Nov 18 2021, 10:39 AM

Nice one Nick.
Can you first run clang-format on this patch? It is pretty much unreadable with all these lint messages.

NickGuy updated this revision to Diff 388438.Nov 19 2021, 2:29 AM

Nice one Nick.
Can you first run clang-format on this patch? It is pretty much unreadable with all these lint messages.

Hmm, looks like my pre-commit hook isn't working anymore, that's disappointing, Should be formatted now though.
Thanks for spreading the word on this too.

Some context for everyone else, this is a prototype of a potential approach for implementing complex number support. Reviews about the general approach, rather than specific implementation details, would be preferred at this stage.

mdchen added a subscriber: mdchen.Nov 22 2021, 2:57 AM
telans added a subscriber: telans.Nov 23 2021, 1:20 AM
simoll added a subscriber: simoll.Nov 30 2021, 12:18 AM
Matt added a subscriber: Matt.Jan 12 2022, 2:31 PM
NickGuy updated this revision to Diff 406409.Feb 7 2022, 5:33 AM
NickGuy retitled this revision from [WIP][RFC][CodeGen] Add support for complex multiplication to [ARM][CodeGen] Add support for complex addition and multiplication.
NickGuy edited the summary of this revision. (Show Details)

Applied feedback received offline, redesigned the interface/implementation, and fixed up a number of fundamental issues in the previous iteration.

This sounds like it should lead to some nice speedup. I have some high level comments:

  • The supportsComplexArithmetic method doesn't add a lot on its own. It would probably be better to represent it in terms of asking the backend if it has certain patterns of certain types available. An f32 fcadd, for example, or a i16 fcmul_with_rotate (I'm not sure what to name them).
  • The ComplexArithmeticGraph seems like internal details to the pass. It would be best not to pass it over the TTI interface boundary if we can. The backend just needs to create the correct intrinsics of the correct type.
  • Equally I would keep all the matching logic in the pass, not through matchComplexArithmeticIR. That would be similar to, for example, the MachineCombiner pass which has its own enum of a set of patterns.
  • Both Arm MVE and AArch64 NEON (and Arm NEON) have fcmla instructions with rotate, two of which combined can perform a complex multiply. If we represent this in terms of the individual instructions that we can use, there is more potential for optimization. You can imagine other cases like adding 10 to the real part of a vector can be transformed to adding fadd [10, 0, 10, 0]. It may be possible to do some of the conjugates with a vrev and a vneg. That kind of thing is (I imagine) likely to come up in larger realistic examples or real math equations. Some of those can hopefully be added later, so long as we think it will be possible to expand. Currently we just seem to be pattern matching individual patterns? With the intention to expand this into a graph?
  • The InterleavedAccessPass (which this is very similar to) uses TLI, not TTI. It may be better to do the same thing here. Similarly is there a reason this is in Transforms and not CodeGen?
  • The tests look good, but they seem to have a funny infinite loop in them. I don't think they need the loop (or the loads/stores, potentially, although they likely don't hurt anything).
  • All the tests are fast, but we will need to make sure this conforms to fp arithmatic correctly without that too.
NickGuy added a comment.EditedFeb 8 2022, 2:23 AM

The supportsComplexArithmetic method doesn't add a lot on its own. It would probably be better to represent it in terms of asking the backend if it has certain patterns of certain types available. An f32 fcadd, for example, or a i16 fcmul_with_rotate (I'm not sure what to name them).
The ComplexArithmeticGraph seems like internal details to the pass. It would be best not to pass it over the TTI interface boundary if we can. The backend just needs to create the correct intrinsics of the correct type.
Equally I would keep all the matching logic in the pass, not through matchComplexArithmeticIR. That would be similar to, for example, the MachineCombiner pass which has its own enum of a set of patterns.

Unfortunately, not all architectures/intrinsics follow the same patterns, so this was done to provide a way for architecture-specific stuff to be handled as needed. The MVE halving complex add (which I only now realise was included in this patch) is a good example of this; It only works on integers, and requires some extra instructions to be matched. This instruction is only in MVE, it is not part of NEON in any way.
The intrinsics that are common are also constructed differently; While the NEON intrinsics are separated out by their rotations (aarch64_neon_vcmla_rot0, aarch64_neon_vcmla_rot90), MVE squishes them into one and determines the rotation based on a parameter.
ComplexArithmeticGraph is used to encapsulate all the data the TTI might need in order to correctly generate the right intrinsic. It could be replaced by something a bit lighter (and that doesn't expose the implementation detail of the node list), but something will be needed to provide this info.

Both Arm MVE and AArch64 NEON (and Arm NEON) have fcmla instructions with rotate, two of which combined can perform a complex multiply. If we represent this in terms of the individual instructions that we can use, there is more potential for optimization. You can imagine other cases like adding 10 to the real part of a vector can be transformed to adding fadd [10, 0, 10, 0]. It may be possible to do some of the conjugates with a vrev and a vneg. That kind of thing is (I imagine) likely to come up in larger realistic examples or real math equations. Some of those can hopefully be added later, so long as we think it will be possible to expand. Currently we just seem to be pattern matching individual patterns? With the intention to expand this into a graph?

Agreed, it should be possible to add them later. Ideally it reaches a point when we can remove the idea of a complex multiply from this, and replace it with the specific operations.

The InterleavedAccessPass (which this is very similar to) uses TLI, not TTI. It may be better to do the same thing here.

I can switch it to use TLI instead, if that's better suited for this job.

Similarly is there a reason this is in Transforms and not CodeGen?

That was down to some build/link issues I was facing when getting the pass registered with the new pass manager. I don't have the specific error to hand, but it was due to unresolved references. There's no strict reason why it's not in CodeGen.

The tests look good, but they seem to have a funny infinite loop in them. I don't think they need the loop (or the loads/stores, potentially, although they likely don't hurt anything).

The tests were taken from a cpp snippet, and put through llvm-reduce (testing that complex instructions were being generated, the null-tests had their types changed by hand afterwards), so the IR might have some odd control flow.

All the tests are fast, but we will need to make sure this conforms to fp arithmatic correctly without that too.

+1, will add more conformance tests.

Hi, thanks for working on this. I do like the direction this patch goes, by detecting complex operations at IR level. I do have some question though.

The supportsComplexArithmetic method doesn't add a lot on its own. It would probably be better to represent it in terms of asking the backend if it has certain patterns of certain types available. An f32 fcadd, for example, or a i16 fcmul_with_rotate (I'm not sure what to name them).
The ComplexArithmeticGraph seems like internal details to the pass. It would be best not to pass it over the TTI interface boundary if we can. The backend just needs to create the correct intrinsics of the correct type.
Equally I would keep all the matching logic in the pass, not through matchComplexArithmeticIR. That would be similar to, for example, the MachineCombiner pass which has its own enum of a set of patterns.

Unfortunately, not all architectures/intrinsics follow the same patterns, so this was done to provide a way for architecture-specific stuff to be handled as needed. The MVE halving complex add (which I only now realise was included in this patch) is a good example of this; It only works on integers, and requires some extra instructions to be matched. This instruction is only in MVE, it is not part of NEON in any way.

I'm not exactly sure why to do target specific pattern matching here. We could simply add generic complex intrinsics and the different patterns could be matched at each archs ISEL, no? I do agree that it should check in the backend if it should generate a complex operation of a given vector type.

The intrinsics that are common are also constructed differently; While the NEON intrinsics are separated out by their rotations (aarch64_neon_vcmla_rot0, aarch64_neon_vcmla_rot90), MVE squishes them into one and determines the rotation based on a parameter.
ComplexArithmeticGraph is used to encapsulate all the data the TTI might need in order to correctly generate the right intrinsic. It could be replaced by something a bit lighter (and that doesn't expose the implementation detail of the node list), but something will be needed to provide this info.

These difference can't be detected at isel?

In a more global view, how do you plan to generate such input patterns detected here? Using C/C++ complex usually generates a structure with real/immediate values (https://godbolt.org/z/4snrjj4oG). And they can't be used as base value for the vectors extension.
Will there be a frontend patch linked to this, or you expect to use target specific builtins for it? If target specific builtins, those could generate the target intrinsics directly, no?

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
338–341 ↗(On Diff #406409)

Is there any strong reason to create target specific intrinsics instead of having generic intrinsics with polymorphism? Such as llvm.complex.add.v2f32?

llvm/include/llvm/Transforms/Scalar/ComplexArithmetic.h
107 ↗(On Diff #406409)

nit. Are you expecting to have values in NodeTypes enum bigger than Discover ? If not, you don't need the and operation "& Discover" if you define that the underlying type of the enum is unsigned or unsigned short.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
2365 ↗(On Diff #406409)

nit. Add new line?

2457 ↗(On Diff #406409)

nit. extra space.

...
I'm not exactly sure why to do target specific pattern matching here. We could simply add generic complex intrinsics and the different patterns could be matched at each archs ISEL, no? I do agree that it should check in the backend if it should generate a complex operation of a given vector type.
...
Is there any strong reason to create target specific intrinsics instead of having generic intrinsics with polymorphism? Such as llvm.complex.add.v2f32?

I wanted to avoid adding target-independent intrinsics, as doing so might require substantial upstream consensus (Though this is being worked on in D119287). My thinking was that we can work on targeting our own architectures first, and then match incoming target-independent intrinsics after they are implemented. Additionally, IR intrinsics for Arm and AArch64 are already implemented, further reducing the amount of work required to enable initial support.

These difference can't be detected at isel?

Unless I'm missing something, there aren't any differences to be detected by isel. The distinction only applies when generating the relevant IR intrinsic, whereas isel is responsible for substituting the IR intrinsic with the instruction/s (in my understanding).

In a more global view, how do you plan to generate such input patterns detected here?

The patterns I've been focusing on are those emitted after loop vectorisation (which incidentally are also those emitted using std::complex with -ffast-math), because the MVE and Neon complex number instructions are vector instructions. Scalar patterns (like those in your linked snippet) are planned, but will require a bit more work to implement properly.

fhahn added a comment.Feb 9 2022, 9:00 AM

I think @jcranmer-intel is working on getting Clang to emit target-independent intrinsics for complex operations (see D119284 & linked patches). It might be good to sync up.

As Florian mentioned, I just re-uploaded a full stack of patches for complex intrinsics support, ranging from defining multiply and divide intrinsics, including an expansion for x86 architecture in both expansion to __mulsc3 and friends and full lowering to instructions, as well as building on top of them to finally get CX_LIMITED_RANGE support into clang. The most interesting patch is probably D119287, since that's the one that does all of the codegen work that this is largely doing, and I personally don't have sufficient expertise with ARM or AArch64 to design that code very well.

I haven't delved into the ARM-specific code in detail, but the ComplexArithmeticGraph feels like it's reinventing a lot of Instruction-like infrastructure just to avoid having to do anything with complex intrinsics. I may be biased here in thinking that it would be be better to move to standardized complex intrinsics, and I understand why you don't want to go there, but there are two questions I have:

  • What operations would you need from standardized complex intrinsics to completely eliminate all of logic in ComplexArithmeticGraph?
  • Have you tried an interface that lets targets nominate target-specific intrinsics for complex operations in lieu of creating a new graph ISA?
llvm/include/llvm/Transforms/Scalar/ComplexArithmetic.h
34 ↗(On Diff #406409)

This says ComplexArithmetic, but it's mostly just limited to complex multiplication, right? There's no support for complex division or absolute value that I see (not that complex division is implemented in any hardware I'm aware of).

llvm/lib/Transforms/Scalar/ComplexArithmetic.cpp
447 ↗(On Diff #406409)

In my experience with complex instructions, one of the big issues is that the varied frontends end up creating a variety of patterns for complex arithmetic, so you end up with a random choice of a complex number being represented in LLVM IR as a struct, a vector, or pairs of scalar numbers (especially as the ABIs for passing or returning them through functions is so insane). That's leaving aside the question of how the complex multiply itself is represented (is it expanded in IR, or a call to __mulsc3, or both?).

By starting the search for a complex multiply at a shufflevector, you're really leaving a lot of opportunities to match complex multiplies off the table. The pattern-matching I did in D119288 looks for insertvalue, insertelement, and matching stores for things that might be the result of complex multiplies or divisions.

(This may also be because you're aiming to do this after vectorization, and one of the problems that I'm trying to solve with complex intrinsics is that the patterns of complex instructions generated by the frontend aren't always kosher for vectorization).

...
I'm not exactly sure why to do target specific pattern matching here. We could simply add generic complex intrinsics and the different patterns could be matched at each archs ISEL, no? I do agree that it should check in the backend if it should generate a complex operation of a given vector type.
...
Is there any strong reason to create target specific intrinsics instead of having generic intrinsics with polymorphism? Such as llvm.complex.add.v2f32?

I wanted to avoid adding target-independent intrinsics, as doing so might require substantial upstream consensus (Though this is being worked on in D119287). My thinking was that we can work on targeting our own architectures first, and then match incoming target-independent intrinsics after they are implemented. Additionally, IR intrinsics for Arm and AArch64 are already implemented, further reducing the amount of work required to enable initial support.

My understanding was that this had very little to do with upstream consensus and more to do with generating optimal code. I probably didn't say this clearly enough but as far as I see it this pass shouldn't be matching "complex multiply". It should be matching "partial multiply with rotate".

AArch64 (under certain architecture options) has an instruction that is called fcmla. (Arm MVE has similar instructions, which this pass is targeting at the moment). It takes a vector and a "rotate" that can be 0, 90, 180 or 270. It performs something like (d0=d0+s0*t0, d1=d1+s0*t1) for the odd/even lanes with a rotate of #0. Or (d0=d0-s1*t1, d1=d1+s1*t0) with a rotate of #90. You can combine two fcmla with rotations of 0 and 90 to produce a single complex multiply, but they can also be combined in all kinds of other weird and wonderful ways. There is no single "complex multiply" instruction for AArch64/MVE, and if you limit the optimizer to just acting on complex multiply, you are always going to be leaving performance on the table from all the other patterns that could be selected.

These difference can't be detected at isel?

Unless I'm missing something, there aren't any differences to be detected by isel. The distinction only applies when generating the relevant IR intrinsic, whereas isel is responsible for substituting the IR intrinsic with the instruction/s (in my understanding).

This was written pre-ISel so that it could be shared between AArch64 and Arm MVE. It also makes looking at chunks of related instructions at the same time easier, in the same way as the MVE lane interleaving pass does (you look up to sources and down to sinks and whatnot).

In a more global view, how do you plan to generate such input patterns detected here?

The patterns I've been focusing on are those emitted after loop vectorisation (which incidentally are also those emitted using std::complex with -ffast-math), because the MVE and Neon complex number instructions are vector instructions. Scalar patterns (like those in your linked snippet) are planned, but will require a bit more work to implement properly.

I think @jcranmer-intel is working on getting Clang to emit target-independent intrinsics for complex operations (see D119284 & linked patches). It might be good to sync up.

As Florian mentioned, I just re-uploaded a full stack of patches for complex intrinsics support, ranging from defining multiply and divide intrinsics, including an expansion for x86 architecture in both expansion to __mulsc3 and friends and full lowering to instructions, as well as building on top of them to finally get CX_LIMITED_RANGE support into clang. The most interesting patch is probably D119287, since that's the one that does all of the codegen work that this is largely doing, and I personally don't have sufficient expertise with ARM or AArch64 to design that code very well.

Yeah thanks, we saw the updates. I had already left a message on https://reviews.llvm.org/D114398 a long time ago, but I think it was missed because the patch was a draft, and apparently that makes it fairly hidden for a phabriactor review.

I still have mis-givings about a generic complex intrinsics being the correct representation for the mid-end of llvm. As far as I understand at the moment they would just block fmul/fadd/etc folds that might already be happening, and block any vectorization (which from what I understand of the problem domain of complex numbers is really important!)

As I said above, it fells difficult for me to see how they will produce the most optimial solution, given the instrucions that are really available. And I don't yet have any examples of anything it really makes better/easier to optimize.

Happy to hear your thoughts on that, my experience with complex numbers mainly comes from people using arrays of interleaved real/imaginary numbers, not _Complex or std::complex. From my understanding of what Nick had tried, they would all vecotorize so that this pass (perhaps opportunistically) could match to something better for the target. From looking at the patches of the complex intrinsic they look much more focussed on the clang representation and whether they will overflow.

Thanks all for the comments so far (And thanks Dave for taking on the evening shift, as it were)

I haven't delved into the ARM-specific code in detail, but the ComplexArithmeticGraph feels like it's reinventing a lot of Instruction-like infrastructure just to avoid having to do anything with complex intrinsics.

In hindsight, calling it a graph may be a misnomer; It acts more as a registry for attaching metadata to instructions, sitting alongside the Instruction use-def graph.

What operations would you need from standardized complex intrinsics to completely eliminate all of logic in ComplexArithmeticGraph?

Having operations that can map to the Arm/AArch64 instructions would be a requirement for us to solely depend on them. Beyond those defined in D119284; Addition with rotations, and multiplying both complex components by a real/imaginary number.
That said, having the frontend emit numerous intrinsics just for Arm/AArch64 might set a bad precedent. I can imagine a case where different architectures have different ideas of what a complex multiply looks like, and each wants it's own idea reprensented by the standard intrinsics. It might be a better idea to have the frontend emit intrinsics for the concepts of a complex multiply, while targets can then match the specific patterns they're interested in beyond those.

Have you tried an interface that lets targets nominate target-specific intrinsics for complex operations in lieu of creating a new graph ISA?

I mighr be missing something, but I'm not sure I follow how that would be different from delegating down to the TTI to generate the intrinsic. As I alluded to in a previous comment, the major problem with having common intrinsic building is that parameters are different across architectures.

This says ComplexArithmetic, but it's mostly just limited to complex multiplication, right? There's no support for complex division or absolute value that I see (not that complex division is implemented in any hardware I'm aware of).

Multiplication and addition in it's current state. It was designed through an Arm-tinted lens, so only supports the operations we have instructions for.

By starting the search for a complex multiply at a shufflevector, you're really leaving a lot of opportunities to match complex multiplies off the table. The pattern-matching I did in D119288 looks for insertvalue, insertelement, and matching stores for things that might be the result of complex multiplies or divisions.

I've seen insertelement emitted in the scalar versions, which is on my list. But we've focused on matching the vector forms first, as those are where we saw the best potential gains in our preliminary investigations.

...my experience with complex numbers mainly comes from people using arrays of interleaved real/imaginary numbers, not _Complex or std::complex. From my understanding of what Nick had tried, they would all vecotorize so that this pass (perhaps opportunistically) could match to something better for the target...

That's correct, both cases (arrays of interleaved, and std::complex) are vectorised the same way, and produce the same IR. Though using std::complex does need -ffast-math to prevent generation of __mulsc3 and co.

NickGuy updated this revision to Diff 416915.Mar 21 2022, 6:54 AM
NickGuy edited the summary of this revision. (Show Details)

It's been longer than I would've liked before I've gotten back to this, but I've;

  • Added support for matching partial complex multiplication patterns, where a complex number is multiplied by either the real or imaginary component.
  • Moved the lowering process from TTI to TLI.
  • Minimised the amount of data passed to the TLI, to only expose what is needed (and not implementation details).

I had also experimented with the idea of taking a case where complex numbers are multiplied by real numbers, but the
performance overhead of expanding the real number vector to match the expected pattern of our complex instructions (<a, _, b, _, c, _, ...>)
caused any gains to be marginal at best, and substantial regressions were seen almost across the board.

I don't intend on adding any new patterns to this patch, further patterns (and AArch64 support) will be implemented in follow-up patches.

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 6:54 AM

OK - we can keep this going - we know it should give some decent performance. We can see where we go from there if something else materializes that can give the same gains.

At a high level - what this pass is doing either seems to be too complicated or too simple. What it effectively seems to do is pattern match shuffle(fadd/fmul/etc(shuffle, shuffle)) is a way that matches one of several patterns, and generates target intrinsics from that. If that is all that is needed then the "Graph" that gets created doesn't seem necessary. It only ever stores a single node, so isn't much of a graph. It is being more complicated than it needs to be.

But I imagine in practice that a lot of math routines will do more than a single complex multiply. There will be multiple parts, this should be building a graph that is made up of several nodes bit at a time and generating code from it only if it can create a full graph that looks profitable to generate.

There are a number of other comments below.

llvm/include/llvm/CodeGen/ComplexArithmetic.h
1 ↗(On Diff #416915)

ComplexArithmetic is a bit of a general name for what this pass is trying to achieve. What about something like ComplexDeinterleavingPass? It better explains that the pass is performing the deinterleaving from shuffles.

20 ↗(On Diff #416915)

Is this needed? Its strange to a macro like this and I'm not sure it's very useful for the places it is used.

59 ↗(On Diff #416915)

It doesn't seem that it would be necessary to combine ComplexArithmeticOperation operations together. It doesn't make much sense to have something that is both an Add _and_ a Partial Mul.

61 ↗(On Diff #416915)

Is this just needed to pass data between the pass and the TLI? It's probably simpler (and more like the existing methods) to just pass the parameters individually.

llvm/include/llvm/CodeGen/Passes.h
85

I think this line can be removed. The other comments here are /// doxygen comments.

llvm/include/llvm/CodeGen/TargetLowering.h
2974

This isn't needed if the target adds the pass itself. There will probably need to be a way to specify _which_ patterns the target supports for a given type. For example MVE has both integer and floating point complex operations. If the subtarget has only MVE (not MVE.fp), then it needs to support the integer complex operations, without supporting floating point. Other differences could exist like one architecture supporting a different subset of operations.

2980

Is GeneratedIntrinsicCount just for statistics? If so it doesn't seem to be worth complicating the interface for.

llvm/lib/CodeGen/ComplexArithmetic.cpp
40 ↗(On Diff #416915)

This leaks memory. Same for the ones below. They can probably return a SmallVector<int>.

107 ↗(On Diff #416915)

Why are Real and Imaginary needed?

110 ↗(On Diff #416915)

Loads and Stores don't appear in the graph.

112 ↗(On Diff #416915)

Why is AddOperand needed? What does it mean?

113–114 ↗(On Diff #416915)

How is Preserve different to Input?

116 ↗(On Diff #416915)

What does this refer to?

119–120 ↗(On Diff #416915)

What does this mean and refer to?

321 ↗(On Diff #416915)

This looks like debugging code that should be removed. There is already an option to disable it.

660–661 ↗(On Diff #416915)

Remove commented out code.

703 ↗(On Diff #416915)

This shouldn't be something that is hard-coded in the pass. It might make more sense to push the splitting up of larger types to the target. There are cases like fixed-length SVE where the width won't be limited to 128bit vectors, and pushing that out to the target will be more general.

804 ↗(On Diff #416915)

This logic is all a bit strange. Why have Changed?

for (auto &I : *B) {
  if (auto *SVI = dyn_cast<ShuffleVectorInst>(&I)) {
    if (isInterleaving(SVI))
      if (traverseAndPopulateGraph(TLI, SVI, Graph))
        Substituted |= substituteGraph(TLI, &I, Graph, DeadInsts);
820 ↗(On Diff #416915)

I think this is just RecursivelyDeleteTriviallyDeadInstructions

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
15 ↗(On Diff #416915)

These are probably unneeded?

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll
2

I would use run lines that that are similar to the llvm/test/CodeGen/Thumb2/mve-xyz.ll tests. Something like:

; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs %s -o - | FileCheck %s

It's best not to use a specific cpu, using the arch instead.

The tests can probably go in the same place too, under llvm/test/CodeGen/Thumb2/mve-xyz.ll for mve tests.

5

This probably isn't worth testing if it is testing codegen already.

33

A lot of these tests are strange - they seem to have infinite loops?

I think you should be able to remove most of it to make a much cleaner test. It doesn't even need the loads/stores:

define <8 x half> @complex_add_v8f16(<8 x half> %a, <8 x half> %b) {
entry:
  %strided.vec = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec21 = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %strided.vec23 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec24 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %0 = fsub fast <4 x half> %strided.vec23, %strided.vec21
  %1 = fadd fast <4 x half> %strided.vec24, %strided.vec
  %interleaved.vec = shufflevector <4 x half> %0, <4 x half> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
  ret <8 x half> %interleaved.vec
}
NickGuy updated this revision to Diff 421520.Apr 8 2022, 7:24 AM
NickGuy marked 22 inline comments as done.
NickGuy edited the summary of this revision. (Show Details)
NickGuy added inline comments.
llvm/include/llvm/CodeGen/ComplexArithmetic.h
20 ↗(On Diff #416915)

It was more useful (though not necessarily needed) in an earlier iteration. I've removed it now.

59 ↗(On Diff #416915)

As above, useful in an earlier iteration. Also removed

llvm/lib/CodeGen/ComplexArithmetic.cpp
40 ↗(On Diff #416915)

Good catch, fixed.

119–120 ↗(On Diff #416915)

This entire enum is no longer required, so I've removed it all.

703 ↗(On Diff #416915)

I've pushed the 128 part to be provided by the target, but couldn't figure out a nice way to do the splitting entirely on the target. I'll be revisiting this when I have concrete examples of how scalable vectors will work.

804 ↗(On Diff #416915)

Holdover from an early iteration, where I tried to build the whole graph before changing anything. Simplified

820 ↗(On Diff #416915)

Looks to do the job, thanks

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll
33

These were initially generated from a larger IR, and pushed through llvm-reduce. I've rewritten them manually to be much smaller (basing them off of your snippet).

mgabka added a subscriber: mgabka.Apr 14 2022, 3:25 AM