This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
CodeGen/
1/1
ComplexDeinterleavingPass.h
1/1
Passes.h
2/2
TargetLowering.h
3/3
InitializePasses.h
-
lib/
-
CodeGen/
2/2
CMakeLists.txt
90/90
ComplexDeinterleavingPass.cpp
-
Target/ARM/
-
ARM/
-
ARMISelLowering.h
5/5
ARMISelLowering.cpp
2/2
ARMTargetMachine.cpp
-
test/CodeGen/ARM/
-
CodeGen/
-
ARM/
-
ComplexArithmetic/
5/5
complex-arithmetic-f16-add.ll
-
complex-arithmetic-f16-mul.ll
4/4
complex-arithmetic-f32-add.ll
1/1
complex-arithmetic-f32-mul.ll
-
complex-arithmetic-f64-add.ll
1/1
complex-arithmetic-f64-mul.ll
1/1
O3-pipeline.ll

Differential D114174

[ARM][CodeGen] Add support for complex deinterleaving
ClosedPublic

Authored by NickGuy on Nov 18 2021, 10:29 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer
samtebbs

Commits

rGd52e2839f3b1: [ARM][CodeGen] Add support for complex deinterleaving

Summary

Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

mgabka added a subscriber: mgabka.Apr 14 2022, 3:25 AM

huntergr added inline comments.May 30 2022, 2:03 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
483	I can't find any place where rotations are dealt with besides assignment and equality checks of 0,90,180,270 -- might an enum be preferable? Or do you anticipate doing arithmetic with them in a follow-up patch?
691	I think it would be worthwhile to add a comment about what the interleaving represents -- that you're looking for a shuffle that takes separate vectors of real and imaginary parts and combines them before they are stored to memory (or returned in registers), and that this is just for matching per-lane operations instead of cross-lane (like a reduction). Or at least that's the behaviour I've observed when testing your patch with a loop like the following: #define LEN (512) float _Complex a[ LEN ]; float _Complex b[ LEN ]; float _Complex c[ LEN ]; void foo (void) { #pragma clang loop vectorize(enable) for (int i = 0; i < LEN; ++i) a[i] = b[i] * c[i]; }

Since the last patch, I've redesigned the high-level approach; It now crawls through the instruction graph to find opportunities for complex deinterleaving before attempting to perform said deinterleaving. Doing it this way allows us to short-circuit, bailing out and preventing the heavy work from being performed if something isn't supported.
This iteration also implements support for complex operations with multiple steps/operands (e.g. a[i] * b[i] * c[i])

Harbormaster completed remote builds in B171067: Diff 438664.Jun 21 2022, 6:25 AM

I haven't really looked to deeply into the meat of the pass yet - how it does the matching - but I had a chunk of comments for the rest.

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes? It would build up a graph of complex nodes to replace the original shuffle, providing that the leaves were all valid de-interleavings. At least on paper I feel we should just be able to perform the search without looking though a lot of uses (except for checking there are no extra uses, of course), and all the Unmatched instructions and what looks like matching of random unrelated snippets would be cleaned up.

llvm/lib/CodeGen/CMakeLists.txt
49	Formatting is off here
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
1–2	This looks like it got formatted incorrectly.
99	item -> Item
100	i -> I (or a better name, if I is already used)
149	Reciprocal Throughput is more common.
511	evaluateComplexDeinterleavingBasicBlock -> evaluateBasicBlock
991–992	It is better to structure things in a way where we decide whether to do something, then do it. Not get half way through doing it and decide we didn't want to in the end. In what ways would we expect createComplexDeinterleavingIR to return nullptr at the moment?
1017	Is this cost necessary at the moment, or will it always be profitable for the simple cases?
llvm/lib/Target/ARM/ARMISelLowering.cpp
21738–21742	Things like this can be a single if: if (!VTy \|\| VTy->getNumElements() * VTy->getScalarSizeInBits() != 128) return false; I presume the == 128 can be removed if we teach it how to split the vectors up?
21773	Can this be an assert instead.
llvm/lib/Target/ARM/ARMTargetMachine.cpp
437	This doesn't need to add the new brackets.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-mul.ll
32	Add arm_aapcs_vfpcc to any tests that take or return vectors. I think you can remove #0 too.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f64-mul.ll
2	It might be a little cleaner to add +fp64 for all these f64 tests.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-rotations-add.ll
1 ↗	(On Diff #421520)	MVE tests can go into llvm/test/CodeGen/Thumb2/mve-complex-xyz.ll. So long as they are all updated by the test script, that should be fine to keep them with the rest.

In D114174#3601236, @dmgreen wrote:

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?

While that would work "simply" enough for cases like a * b, more elaborate cases (e.g. (a * b) * c) would result in some ambiguity as to which add(mul, mul) pairs with which sub(mul, mul). This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?

Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like DAGTypeLegalizer::SplitVectorResult).

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Remaining comments will be addressed in a follow-up patch.

chill added a subscriber: chill.Jun 28 2022, 3:28 AM

chill added inline comments.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
136	Here and elsewhere `std::find` -> `llvm::find`
593	Why are we scanning the whole block? It looks to me (admittedly I don't understand what this pass does yet) we can just walk over the `Instructions` vector (in reverse, if the order matters), avoiding the quadratic complexity.
633	Here and in a few other places `for (Value *V : Op0->operands()) {`

Do we have some broad overview of the approach and the algorithm? It'd be a good idea put something like this in the description and eventually the commit message. (I searched for LLVM complex RFCs, buy couldn't find anything useful).

chill added inline comments.Jun 28 2022, 7:42 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
325	`return std::make_shared<ComplexDeinterleavingCompositeNode>(Operation);`
561	These could be, e.g.: static const int RealMask[] = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}; auto RealMaskRef = ArrayRef<int>(RealMask, ShufleMask.size()); with an assertion/bounds check. Good enough for 512-bit vectors with 16-bit elements, can be extended.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99	Shouldn't these be translated to a couple of `vcadd.f32` instructions, like in the previous test? And this amount of move instructions seems excessive.

chill added inline comments.Jun 28 2022, 7:46 AM

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99	And this amount of move instructions seems excessive. I guess MVE does not have sensible swizzling instructions.

In D114174#3601263, @NickGuy wrote:

In D114174#3601236, @dmgreen wrote:

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?

While that would work "simply" enough for cases like a * b, more elaborate cases (e.g. (a * b) * c) would result in some ambiguity as to which add(mul, mul) pairs with which sub(mul, mul). This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

Hmm, OK. I was thinking about a three way multiply when looking at that way of structuring the pass, but only one pattern and only ever on paper. I was worried that there were cases where it was ambiguous, but figured if it was it could always just try both possibilities. But I've not implemented it, it just sounded like an elegant way of treating this like slightly more complex pattern matcher, as opposed to all this findUnmatchedInstructions and looking through uses.

A broad overview of the algorithm would be useful, like Momchil mentioned, perhaps in the file description. I was trying to make a sketch before but got a bit lost in nested loops.

I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?

Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like DAGTypeLegalizer::SplitVectorResult).

A separate addition sounds good. Lets try and get something that we happy with and extend it out.

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Do you have a test-case where this happens? Cost modelling will probably be good to add in the long run, I'd just like to understand where it currently can make things worse.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99	Yeah this is expected from shuffles that MVE cannot handle very well. It would look a lot better either with loads that could become interleaving loads, or under AArch64 where better shuffles are available.

Addressed comments, and added a file description that attempts to explain what the pass does.

NickGuy added a child revision: D129066: [AArch64][CodeGen] Add AArch64 support for complex deinterleaving.Jul 4 2022, 2:56 AM

NickGuy added a child revision: D129067: [CodeGen] Complex deinterleaving vector splitting.Jul 4 2022, 2:58 AM

Harbormaster completed remote builds in B173522: Diff 442045.Jul 4 2022, 3:20 AM

NickGuy marked 15 inline comments as done.Jul 4 2022, 3:42 AM

NickGuy added inline comments.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
593	The order of the `CompositeNode`s is important, this was a holdover from before we were sorting the nodes.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99	Shouldn't these be translated to a couple of vcadd.f32 instructions, like in the previous test? Not in this case, no. The IR vectors are too wide to fit into actual vector registers, and the follow-up vector splitting patch (D129067) is restricted to splitting in half.

Some of the comments looked like they didn't get resolved, and the latest version is missing context.

It would be good to get to the point where the tests are in a good place and we can commit those, allowing us to just show just the differences here. They need to be in the right place first though.

NickGuy updated this revision to Diff 449946.Aug 4 2022, 6:00 AM

Harbormaster completed remote builds in B179253: Diff 449946.Aug 4 2022, 6:00 AM

I've pre-committed the tests, and redesigned the core algorthm to operate on pairs of operands, rather than analysing the uses of a given instruction. The vector splitting has also been implemented on the target side

The tests are good, but they currently only cover each of the types for vcadd and vcmla0/90. I think we need more to show all the edgecases that should or should not be matched. They are not needed for all types, but we should try and make examples of each.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
163	ConvergingI is the Root of the graph? Perhaps just call it RootValue.
529	What if the operands are FMul(A, A) and FMul (B, C)? I think it needs to be more precise about which nodes are which.
564	I don't think it makes sense to treat these backwards. We know the CR part should be real and the CI is imaginary. It would only be valid to treat them the other way around if there was some sort of shuffle added.
593	Does this need to be calling identifyNode on the third argument too? The one derived from UncommonValues.
656	This looks like it needs to match: RealI and ImagI are both shuffles. They both have the same Operand0 They both have "deinterleaving" masks. I don't think the type of the value of Operand0 otherwise matters. It doesn't matter if it is a Load or a Argument, we can always just use it.
749	It's best not to use static data like this, we can make it more generic. The match can be `m_Shuffle(..., m_Mask(Mask))`, then check that the Mask is an isDeinterleavingMask. It does need to check _which_ deinterleaving mask it is though for the Real/Imaginary parts.
854	If the Operands that CN depends on are included in the Node, then we can just walk up the tree making sure we create the Operands before the Nodes that use them, using the Value's that the operands produce as the Input0/Input1/Accumulator below. That avoids the need to "wrangle" any inputs at this stage, as we already know the nodes we need.
869	It might be simpler to remove anything Cost based from this revision, adding it back in if it is needed in followup patches. All the identification that happens in this patch should always be cheaper, as far as I understand.

Addressed comments and improved case coverage,

Harbormaster completed remote builds in B185092: Diff 457993.Sep 5 2022, 7:33 AM

Some of this seems to have returned to a previous version of the patch?

llvm/lib/CodeGen/CMakeLists.txt
49	This formatting again?
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
432	Similar to identifyPartialMul, this probably needs to be more careful about what it is selecting as which operands.
529	I still think this needs to be a more precise with regard to what is considered the CommonOperand and CommonOperand.
569–571	Is it valid to test the operands in the opposite order?
624	Should these be checking the subnodes? if (!identifyNode(..)) return nullptr;
647	CommonOperandI only seems to be important between identifyPartialMul and identifyNodeWithImplicitAdd
680–681	These need to check that the first is the real deinterleave with offset=0, and the imag has offset=1. And maybe that they only take elements from the first operand and don't change size.
724–732	I feel that this should be correct by construction. What cases are not correct?
791–793	I don't think this should be needed. The inputs should just be present from the ReplacementNode of the operands.
811	If the statistic is being awkward then it is probably not worth keeping (or keeping simple - just counting number of transforms, not the number of individual intrinsics that might become in the backend).
llvm/lib/Target/ARM/ARMISelLowering.cpp
21776	It is better not to use fixed sized arrays like this, just construct the array to be any needed size. If the values are continuous you can use iota.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll
3	These tests seem to have moved back to the wrong place.

Addressed comments

Harbormaster completed remote builds in B185432: Diff 458481.Sep 7 2022, 9:41 AM

OK great. It would be good to see a lot of test cases for pretty much anything that might go wrong. Things like the shuffles with incorrect masks, fadd's where there should be fsubs, patterns that cross basic blocks, any of the conditions that can currently return false that we can test.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
128	Can we just use nullptr as the sentinel value for IsValid = false, removing the need for this struct? If we need some way to represent the last shuffle node then we could add a Shuffle or Leaf node type (which could itself have a ReplacementNode just set to the shuffle->operand(0)). Hopefully that also simplifies resolveInputs / Elevate. Especially if Accumulator is treated in the same way as an Operand. Hopefully that is simpler.
438	Should this function be calling identifyNode on the inputs, similar to identifyPartialMul?
473	When can this be false?
590	I think I was expecting Operands to be the ComplexDeinterleavingCompositeNode* from identifyNode. That avoids the need to find the again later.
717	These seem to do recursive checks into the operands, but that is already being done again in identifyNode. Can we just remove them and rely on identifyNode?
776–777	This isn't used any more, which is good (I think there might be a better way to get a TTI if it was needed).
787–788	When can this return false?

Forgot to submit the comments adjoining the patch...

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
569–571	No, but it's also not invalid (the checks do what they're supposed to and don't produce a node). I've removed these.
647	It used to be needed all through, but that is no longer the case. Changed the signatures and such to reflect this.
724–732	The one that is easiest to see is when attempting to multiply by a value rotated by 270 (`a[i] * (b[i] * III)`), it produces the following IR which has the real and imaginary components reversed at %interleaved.vec %a.real = shufflevector <8 x float> %a, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %a.imag = shufflevector <8 x float> %a, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %b.real = shufflevector <8 x float> %b, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %b.imag = shufflevector <8 x float> %b, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %0 = fmul fast <4 x float> %b.imag, %a.imag %1 = fmul fast <4 x float> %b.real, %a.real %2 = fsub fast <4 x float> %0, %1 %3 = fmul fast <4 x float> %b.imag, %a.real %4 = fmul fast <4 x float> %b.real, %a.imag %5 = fadd fast <4 x float> %4, %3 %interleaved.vec = shufflevector <4 x float> %5, <4 x float> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
787–788	It can be false if a shuffle hasn't been traversed properly. In hindsight though, this could be an assertion instead as it indicates a problem with the node construction.
791–793	It's more of a side effect of analysing the whole tree before actually performing the replacement (which we do in case the tree can't be transformed). With some of the recent changes however, this function can't return false so it's merely a step of the replacement now.
811	Changed to track the transforms rather than individual intrinsics.

NickGuy updated this revision to Diff 459432.Sep 12 2022, 5:58 AM

NickGuy marked 7 inline comments as done.

NickGuy added inline comments.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
438	Not here, no. The inputs need to be resolved as part of the whole pair, so the relevant one for the created node is passed back via CommonOperandI.
473	In it's current state, never. It was possible before moving to the implicitAdd approach
776–777	Good catch, removed.

NickGuy added inline comments.Sep 12 2022, 5:58 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
590	Attempting to plumb something together that acceps both `ComplexDeinterleavingCompositeNode` and `Value` caused more bloat and made things more difficult to follow. And we can't simply use `Node->ReplacementNode` as the `Value*` because it hasn't been assigned yet. Because of that constraint, I'm opting to find the nodes later, though I'm open to ideas.

Harbormaster completed remote builds in B186131: Diff 459432.Sep 12 2022, 5:59 AM

dmgreen added inline comments.Sep 13 2022, 2:22 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
169	Why is this needed? When are the Real and Imag values not already correct? It seems strange to be able to reverse them.
438	What about the other pair? They are not guaranteed to be the same as the operands from identifyPartialMul as far as I can see. I would probably make sure that the two operands are a valid match too. The second call to identifyNode with the same operands should just find the same node again. And that way this can have proper operands. Currently this and identifyPartialMul are very much interlinked. I hope we can improve that in the future, but it sounds like an extension that can be thought about later.
590	I was expecting it to just accept ComplexDeinterleavingCompositeNode. If all the operands are ComplexDeinterleavingCompositeNode, then they just need to be visited in the correct order to assign ReplacementNode based on how they need to be transformed. (Which I believe is fine providing we visit them in reverse order, like is already done). i.e. we construct a graph made up of ComplexDeinterleavingCompositeNode, then transform that graph. That seems like a simpler, more extensible design going forward. It gets more complicated if there is a mixing of values between the original IR and the newly constructed intrinsics. I could imagine that might make graphs (as opposed to DAGs) more difficult, but we don't yet support any cycles.
724–732	Sorry - I'm not sure what this was referring to now! Do you think that example should be transforming into complex intrinsics? It doesn't look valid to me. Have you added it as a test case?
737–738	Can we move this into identifyNodes (or maybe at this point there will always be nodes?) That would make replaceNodes always return true, which simplifies the getDeadRoots a little too.

Addressed comments, and redesigned the lookup list of nodes to now be represented as a graph.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
169	It was a holdover function that I reused for the `.Valid` check. It's not used anywhere else so I've removed it (and the relevant struct), and moved the relevant check to here.
590	I've converted the pseudo-graph (a list with lookups) to an actual graph structure, with every node being a ComplexDeinterleavingCompositeNode, meaning that we're no longer mixing `Value`s and `ComplexDeinterleavingCompositeNode*`s.

Harbormaster completed remote builds in B187920: Diff 461829.Sep 21 2022, 2:52 AM

Have you had any luck adding a larger corpus of tests cases yet? I see that uses have been added to the graph. I would hope that we could avoid that complication, but it is hard to see if we can or not without a larger set of tests.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
762	Try not to overuse auto where the type isn't already obvious.

Added more interesting test cases, covering triangle and diamond multiply cases.

Harbormaster completed remote builds in B189165: Diff 463541.Sep 28 2022, 7:04 AM

Thanks for adding all the extra tests. Here is another one, that is useful in itself, but more useful when we break it. It does a*b +90 a*c, and seems to work well for this example.

define arm_aapcs_vfpcc <4 x float> @test(<4 x float> %a, <4 x float> %b, <4 x float> %c) {
entry:
  %ar = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ai = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %br = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %bi = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %cr = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ci = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3>

  %i6 = fmul fast <2 x float> %br, %ar
  %i7 = fmul fast <2 x float> %bi, %ai
  %xr = fsub fast <2 x float> %i6, %i7
  %i9 = fmul fast <2 x float> %bi, %ar
  %i10 = fmul fast <2 x float> %br, %ai
  %xi = fadd fast <2 x float> %i9, %i10

  %j6 = fmul fast <2 x float> %cr, %ar
  %j7 = fmul fast <2 x float> %ci, %ai
  %yr = fsub fast <2 x float> %j6, %j7
  %j9 = fmul fast <2 x float> %ci, %ar
  %j10 = fmul fast <2 x float> %cr, %ai
  %yi = fadd fast <2 x float> %j9, %j10

  %zr = fsub fast <2 x float> %yr, %xi
  %zi = fadd fast <2 x float> %yi, %xr
  %interleaved.vec = shufflevector <2 x float> %zr, <2 x float> %zi, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

But this is a modification that alters the c*a to use part of the b*a mul's. It shouldn't be being transformed as it is, I don't believe.

define arm_aapcs_vfpcc <4 x float> @mul_triangle_addmul(<4 x float> %a, <4 x float> %b, <4 x float> %c) {
entry:
  %ar = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ai = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %br = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %bi = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %cr = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ci = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3>

  %i6 = fmul fast <2 x float> %br, %ar
  %i7 = fmul fast <2 x float> %bi, %ai
  %xr = fsub fast <2 x float> %i6, %i7
  %i9 = fmul fast <2 x float> %bi, %ar
  %i10 = fmul fast <2 x float> %br, %ai
  %xi = fadd fast <2 x float> %i9, %i10

  ;%j6 = fmul fast <2 x float> %cr, %ar
  %j7 = fmul fast <2 x float> %ci, %ai
  %yr = fsub fast <2 x float> %i6, %j7
  ;%j9 = fmul fast <2 x float> %ci, %ar
  %j10 = fmul fast <2 x float> %cr, %ai
  %yi = fadd fast <2 x float> %i9, %j10

  %zr = fsub fast <2 x float> %yr, %xi
  %zi = fadd fast <2 x float> %yi, %xr
  %interleaved.vec = shufflevector <2 x float> %zr, <2 x float> %zi, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

The Incomplete nodes worry me as it looks like a rich source of bugs. If the identifyPartialMul and identifyNodeWithImplicitAdd need to work together more closely for the time being, that is probably fine. We can always change it in the future if needed.

Some other issues are that we need to check for multiple uses. As in something like

define arm_aapcs_vfpcc <4 x float> @mul_triangle_multiuses(<4 x float> %a, <4 x float> %b, ptr %p) {
; CHECK-LABEL: mul_triangle_multiuses:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    .vsave {d14}
; CHECK-NEXT:    vpush {d14}
; CHECK-NEXT:    .vsave {d10, d11, d12}
; CHECK-NEXT:    vpush {d10, d11, d12}
; CHECK-NEXT:    .vsave {d8}
; CHECK-NEXT:    vpush {d8}
; CHECK-NEXT:    vmov q2, q0
; CHECK-NEXT:    vmov.f32 s16, s4
; CHECK-NEXT:    vmov.f32 s17, s6
; CHECK-NEXT:    vmov.i32 q0, #0x0
; CHECK-NEXT:    vmov.f32 s20, s9
; CHECK-NEXT:    vmov.f32 s21, s11
; CHECK-NEXT:    vmov.f32 s28, s5
; CHECK-NEXT:    vmul.f32 q3, q5, q4
; CHECK-NEXT:    vmov.f32 s29, s7
; CHECK-NEXT:    vmul.f32 q5, q7, q5
; CHECK-NEXT:    vmov.f32 s24, s8
; CHECK-NEXT:    vmov.f32 s25, s10
; CHECK-NEXT:    vneg.f32 q5, q5
; CHECK-NEXT:    vfma.f32 q3, q7, q6
; CHECK-NEXT:    vfma.f32 q5, q4, q6
; CHECK-NEXT:    vmov.f32 s22, s12
; CHECK-NEXT:    vmov.f32 s23, s13
; CHECK-NEXT:    vmov q3, q0
; CHECK-NEXT:    vcmla.f32 q3, q1, q2, #0
; CHECK-NEXT:    vstrw.32 q5, [r0]
; CHECK-NEXT:    vcmla.f32 q3, q1, q2, #90
; CHECK-NEXT:    vcmla.f32 q0, q2, q3, #0
; CHECK-NEXT:    vcmla.f32 q0, q2, q3, #90
; CHECK-NEXT:    vpop {d8}
; CHECK-NEXT:    vpop {d10, d11, d12}
; CHECK-NEXT:    vpop {d14}
; CHECK-NEXT:    bx lr
entry:
  %strided.vec = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %strided.vec35 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %strided.vec37 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %strided.vec38 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %0 = fmul fast <2 x float> %strided.vec37, %strided.vec
  %1 = fmul fast <2 x float> %strided.vec38, %strided.vec35
  %2 = fsub fast <2 x float> %0, %1
  %3 = fmul fast <2 x float> %2, %strided.vec35
  %4 = fmul fast <2 x float> %strided.vec38, %strided.vec
  %5 = fmul fast <2 x float> %strided.vec35, %strided.vec37
  %6 = fadd fast <2 x float> %4, %5
  %otheruse = shufflevector <2 x float> %2, <2 x float> %6, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  store <4 x float> %otheruse, ptr %p
  %7 = fmul fast <2 x float> %6, %strided.vec
  %8 = fadd fast <2 x float> %3, %7
  %9 = fmul fast <2 x float> %2, %strided.vec
  %10 = fmul fast <2 x float> %6, %strided.vec35
  %11 = fsub fast <2 x float> %9, %10
  %interleaved.vec = shufflevector <2 x float> %11, <2 x float> %8, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

And we probably need to check for fast-math flags, where we are generating fma. I also think that it is safer if nodes are uniquely identified from {real, imag} root pairs, not just from nodes that might contain either real or imag somewhere in them.

llvm/test/CodeGen/Thumb2/complex-deinterleaving-uniform-cases.ll
1 ↗	(On Diff #463541)	All the mve tests start with mve-

Thanks @dmgreen for the extra tests.

In D114174#3826540, @dmgreen wrote:

...
The Incomplete nodes worry me as it looks like a rich source of bugs. If the identifyPartialMul and identifyNodeWithImplicitAdd need to work together more closely for the time being, that is probably fine. We can always change it in the future if needed.
...

I've removed the late lookup/replacement of incomplete nodes, and moved it closer to where the incomplete nodes are created. I've also made it only check against a single node, one that we know should match

...
And we probably need to check for fast-math flags, where we are generating fma. I also think that it is safer if nodes are uniquely identified from {real, imag} root pairs, not just from nodes that might contain either real or imag somewhere in them.
...

Done, we're now checking the fast-math flags (specifically contract)

Harbormaster completed remote builds in B191749: Diff 467159.Oct 12 2022, 8:38 AM

dmgreen added inline comments.Oct 14 2022, 10:54 AM

llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h
37	I don't think None is ever used, and from what I can see _Placeholder is a Shuffle? That might be a better name, either Shuffle or Leaf. I would also probably remove _Incomplete. They might make sense, but it is difficult to follow and very easy to get wrong.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
106	I think it would be best to store the root Real and Imag values as part of the ComplexDeinterleavingCompositeNode. That way we can be sure we are matching a correct node from anything like getContainingComposite. There can then be the TransientInstructions (maybe "ExtraInstructions"), which are also part of the pattern, but not the root nodes.
115	Could AccumulatorNode just be another Operand?
116	I think this would be best to avoid NodePtr, if that might keep a reference to a node alive. Avoiding shared_ptr entirely would be good, but I imagine any alternative might be difficult too. They needs to be stable values to avoid memory invalidation issues.
171	Formatting needs updating.
220	If NodeInstructions is used to calculate all the Instructions we have seen, to check that all the uses are inside the graph, it could be a SmallPtrSet which will have quicker contains().
288	It is hard to see how wrapping a Value as an _Incomplete node is always correct. It will have lost the information about whether V was real or imaginary.
576	Do you have any tests for fneg? This should probably be more like the code in identifyPartialMul where the rotation is chosen based on whether the Real/Imag parts are negated.
673	It's not the combination of rotates that is invalid really, although some might make less sense than others, and some might simplify to code that is difficult to match. The even rotations will define the real part of `a`, the odd rotations define the imag part. Between the two parts of the multiply we match we need to find both the real and imag halves of the value, to successfully match it further. (We may be able to do something with only half a match in some cases, like matching leafs, but that should probably be left for a later patch).
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
79 ↗	(On Diff #467159)	Are you sure this is valid? When we were doing the gather lowering, we found it useful to annotate the tests with OK/Bad, so if they change we could see the ones we thought shouldn't.

Addressed comments

Harbormaster completed remote builds in B193506: Diff 469583.Oct 21 2022, 6:47 AM

Thanks for the updates. This is looking good now.

There are some extra comments below. I was trying some examples that have awkward orders and cross basicblocks, but couldn't find ways to make it break.

llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h
40–43	This can now be re-flowed.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
20	Can this comment be updated now?
560	These can be // comments. They are very useful, but maybe it is not necessary to repeat them.
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
4 ↗	(On Diff #469583)	Are there any tests with missing fast-math flags?
5 ↗	(On Diff #469583)	I managed to come up with this testcase that failed because of the vector size. It is probably larger than it needs to be. It could work or not, depending if the backend handles non-power-2 sizes. In either rate, it would be good to add this example. (Once integers are added too, some odd size integer width tests would be good too). define arm_aapcs_vfpcc <12 x float> @abp90c12(<12 x float> %a, <12 x float> %b, <12 x float> %c) { entry: %ar = shufflevector <12 x float> %a, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %ai = shufflevector <12 x float> %a, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %br = shufflevector <12 x float> %b, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %bi = shufflevector <12 x float> %b, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %cr = shufflevector <12 x float> %c, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %ci = shufflevector <12 x float> %c, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %i6 = fmul fast <6 x float> %br, %ar %i7 = fmul fast <6 x float> %bi, %ai %xr = fsub fast <6 x float> %i6, %i7 %i9 = fmul fast <6 x float> %bi, %ar %i10 = fmul fast <6 x float> %br, %ai %xi = fadd fast <6 x float> %i9, %i10 %zr = fsub fast <6 x float> %cr, %xi %zi = fadd fast <6 x float> %ci, %xr %interleaved.vec = shufflevector <6 x float> %zr, <6 x float> %zi, <12 x i32> <i32 0, i32 6, i32 1, i32 7, i32 2, i32 8, i32 3, i32 9, i32 4, i32 10, i32 5, i32 11> ret <12 x float> %interleaved.vec }
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
79 ↗	(On Diff #469583)	Is this just not transforming because of commutativity on the add? Can we add a test for `fadd fast <2 x float> %strided.vec20, %strided.vec` too, to show it does transform.

Came here to say that this looks nice and would be worthy a mention in the release notes, so that can be added, I think.

Then I started reading the patch again a bit and couldn't help myself writing down some nits.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
17	Nit: perhaps just omit `in parallel`? Or if you want to keep it, it could benefit from a clarification.
26	I wouldn't mind a few more words/sentences about the internal datastructures, the graph, nodes, etc.
51	Nit: remove `deinterleaving` from this description?
60	Nit: newline
110	Can you say what these instructions could be?
174	I haven't looked carefully, but why do we also need to record all instructions here? I mean, the graph can be queried for existence of nodes/instructions, or is this bookkeeping duplication more efficient?
182	Nit: no validation going on here, only insertion?
182	Nit: no validation going on here, only insertion?
240	Nit: newline?
313	Nit: can omit the comparison with 1.
344	Nit: I am a fan of reducing indentation, e.g.: auto *SVI = dyn_cast<ShuffleVectorInst>(&I); if (!SVI) continue; if (!isInterleavingMask(SVI->getShuffleMask()) continue; etc.
368	Nit: maybe I overlooked something, but this doesn't seem to use any state from the class so could be just a helper function?
468	Nit: I am wondering if these casts here and below are necessary.
484	nit: maybe some or all of these node setting things can be done in a constructor.
540	Nits: Op -> Operand, not Instruction -> not a Instruction?

NickGuy updated this revision to Diff 473955.Nov 8 2022, 4:10 AM

NickGuy marked 23 inline comments as done.

NickGuy added inline comments.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
174	The duplication makes it simpler to check the internal uses of a node (See `ComplexDeinterleavingCompositeNode::hasAllInternalUses`). Without it, we'd need to iterate over the nodes twice, and check the instructions of each node (All of `Real`, `Imag`, and the contents of `InternalInstructions`), while this approach allows us to simply say "Is this instruction known within the graph?".
182	Good catch, the validation that was here was moved to be more tied into the identification. Renamed
368	It populates the compositeNode structure through (formerly) `validateAndSubmitCompositeNode`, and also calls `identifyNode` further, which does use state from the class.
468	You are correct, good catch.
484	Not sure I agree with that, there are cases - such as a `Shuffle` - where none of the other settings are relevant. The operation type, real instruction, and imaginary instructions are the only settings relevant to all cases.
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
5 ↗	(On Diff #469583)	Fixed this case (by checking on the backend whether the vector width is a power of 2), and added the test.
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
79 ↗	(On Diff #469583)	Yep, it's the commutativity. I've added a note to this test stating this, and added another test with the fadd operands reversed.

Harbormaster completed remote builds in B196683: Diff 473955.Nov 8 2022, 4:10 AM

NickGuy mentioned this in D129066: [AArch64][CodeGen] Add AArch64 support for complex deinterleaving.Nov 8 2022, 5:40 AM

dmgreen added inline comments.Nov 10 2022, 9:19 AM

llvm/include/llvm/InitializePasses.h
106–110	You can undo these
118–121	There is a newline here?
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
413	If you can turn some of these into better sentances that would be good. Perhaps "Mul instruction has multiple uses" like you have below.
482	Formatting is a bit long.
llvm/lib/Target/ARM/ARMTargetMachine.cpp
437	This again.
llvm/test/CodeGen/ARM/O3-pipeline.ll
2	This looks like it was incorrectly added for this test.

khchen added a subscriber: khchen.Nov 10 2022, 10:06 AM

NickGuy updated this revision to Diff 474743.Nov 11 2022, 6:39 AM

NickGuy marked 6 inline comments as done.

Harbormaster completed remote builds in B197217: Diff 474743.Nov 11 2022, 6:39 AM

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

llvm/include/llvm/InitializePasses.h
117–118	You can undo these.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
591	Formatting.
llvm/lib/Target/ARM/ARMISelLowering.cpp
21754	Formatting.
21762	Add a message to the assert

This revision is now accepted and ready to land.Nov 14 2022, 5:39 AM

This revision was landed with ongoing or failed builds.Nov 14 2022, 6:04 AM

Closed by commit rGd52e2839f3b1: [ARM][CodeGen] Add support for complex deinterleaving (authored by NickGuy). · Explain Why

This revision was automatically updated to reflect the committed changes.

NickGuy marked 4 inline comments as done.

NickGuy added a commit: rGd52e2839f3b1: [ARM][CodeGen] Add support for complex deinterleaving.

In D114174#3924650, @dmgreen wrote:

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

Agreed, and thanks for working on this.

Question about AArch64, is that something you are going to work on next?

In D114174#3924782, @SjoerdMeijer wrote:

In D114174#3924650, @dmgreen wrote:

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

Agreed, and thanks for working on this.

Question about AArch64, is that something you are going to work on next?

Yep, over at D129066. It's currently slightly behind this, but is mostly there already.

NickGuy mentioned this in D142012: [ReleaseNotes] Adds mention of complex number support for ARM and AArch64 backends..Jan 18 2023, 6:11 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

ComplexDeinterleavingPass.h

40 lines

Passes.h

4 lines

TargetLowering.h

22 lines

InitializePasses.h

31 lines

lib/

CodeGen/

CMakeLists.txt

1 line

ComplexDeinterleavingPass.cpp

1036 lines

Target/

ARM/

ARMISelLowering.h

9 lines

ARMISelLowering.cpp

74 lines

ARMTargetMachine.cpp

9 lines

test/

CodeGen/

ARM/

ComplexArithmetic/

complex-arithmetic-f16-add.ll

342 lines

complex-arithmetic-f16-mul.ll

387 lines

complex-arithmetic-f32-add.ll

192 lines

complex-arithmetic-f32-mul.ll

222 lines

complex-arithmetic-f64-add.ll

199 lines

complex-arithmetic-f64-mul.ll

379 lines

O3-pipeline.ll

1 line

Diff 438664

llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h

This file was added.

				//===- ComplexDeinterleavingPass.h - Complex Deinterleaving Pass - C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass implements generation of target-specific intrinsics to support
				// handling of complex number arithmetic and deinterleaving.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CODEGEN_COMPLEXDEINTERLEAVING_H
				#define LLVM_CODEGEN_COMPLEXDEINTERLEAVING_H

				#include "llvm/IR/PassManager.h"
				#include "llvm/IR/PatternMatch.h"

				namespace llvm {

				class Function;
				class TargetMachine;

				struct ComplexDeinterleavingPass
				: public PassInfoMixin<ComplexDeinterleavingPass> {
				private:
				TargetMachine *TM;

				public:
				ComplexDeinterleavingPass(TargetMachine *TM) : TM(TM) {}

				PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
				};

				enum class ComplexDeinterleavingOperation { None, CAdd, CMulPartial };

				dmgreenUnsubmitted Done Reply Inline Actions I don't think None is ever used, and from what I can see _Placeholder is a Shuffle? That might be a better name, either Shuffle or Leaf. I would also probably remove _Incomplete. They might make sense, but it is difficult to follow and very easy to get wrong. dmgreen: I don't think None is ever used, and from what I can see _Placeholder is a Shuffle? That might…
				} // namespace llvm

				#endif // LLVM_CODEGEN_COMPLEXDEINTERLEAVING_H

llvm/include/llvm/CodeGen/Passes.h

//===-- Passes.h - Target independent code generation passes ----- C++ --===//		//===-- Passes.h - Target independent code generation passes ----- C++ --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
		dmgreenUnsubmitted Done Reply Inline Actions I think this line can be removed. The other comments here are /// doxygen comments. dmgreen: I think this line can be removed. The other comments here are /// doxygen comments.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file defines interfaces to access the target independent code generation		// This file defines interfaces to access the target independent code generation
// passes provided by the LLVM backend.		// passes provided by the LLVM backend.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	namespace llvm {
/// If AbortOnFailedISel is true, abort compilation instead of resetting.		/// If AbortOnFailedISel is true, abort compilation instead of resetting.
MachineFunctionPass *createResetMachineFunctionPass(bool EmitFallbackDiag,		MachineFunctionPass *createResetMachineFunctionPass(bool EmitFallbackDiag,
bool AbortOnFailedISel);		bool AbortOnFailedISel);

/// createCodeGenPreparePass - Transform the code to expose more pattern		/// createCodeGenPreparePass - Transform the code to expose more pattern
/// matching during instruction selection.		/// matching during instruction selection.
FunctionPass *createCodeGenPreparePass();		FunctionPass *createCodeGenPreparePass();

		/// This pass implements generation of target-specific intrinsics to support
		/// handling of complex number arithmetic
		FunctionPass createComplexDeinterleavingPass(const TargetMachine TM);

/// AtomicExpandID -- Lowers atomic operations in terms of either cmpxchg		/// AtomicExpandID -- Lowers atomic operations in terms of either cmpxchg
/// load-linked/store-conditional loops.		/// load-linked/store-conditional loops.
extern char &AtomicExpandID;		extern char &AtomicExpandID;

/// MachineLoopInfo - This pass is a loop analysis pass.		/// MachineLoopInfo - This pass is a loop analysis pass.
extern char &MachineLoopInfoID;		extern char &MachineLoopInfoID;

/// MachineDominators - This pass is a machine dominators analysis pass.		/// MachineDominators - This pass is a machine dominators analysis pass.
▲ Show 20 Lines • Show All 478 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/TargetLowering.h

Show All 14 Lines
/// 3. Cost thresholds for alternative implementations of certain operations.		/// 3. Cost thresholds for alternative implementations of certain operations.
///		///
/// In addition it has a few other components, like information about FP		/// In addition it has a few other components, like information about FP
/// immediates.		/// immediates.
///		///
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_CODEGEN_TARGETLOWERING_H		#ifndef LLVM_CODEGEN_TARGETLOWERING_H
#define LLVM_CODEGEN_TARGETLOWERING_H		#define LLVM_CODEGEN_TARGETLOWERING_H
		dmgreenUnsubmitted Done Reply Inline Actions This isn't needed if the target adds the pass itself. There will probably need to be a way to specify _which_ patterns the target supports for a given type. For example MVE has both integer and floating point complex operations. If the subtarget has only MVE (not MVE.fp), then it needs to support the integer complex operations, without supporting floating point. Other differences could exist like one architecture supporting a different subset of operations. dmgreen: This isn't needed if the target adds the pass itself. There will probably need to be a way to…

		#include "ComplexDeinterleavingPass.h"
#include "llvm/ADT/APInt.h"		#include "llvm/ADT/APInt.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/STLArrayExtras.h"		#include "llvm/ADT/STLArrayExtras.h"
		dmgreenUnsubmitted Done Reply Inline Actions Is GeneratedIntrinsicCount just for statistics? If so it doesn't seem to be worth complicating the interface for. dmgreen: Is GeneratedIntrinsicCount just for statistics? If so it doesn't seem to be worth complicating…
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/CodeGen/DAGCombine.h"		#include "llvm/CodeGen/DAGCombine.h"
#include "llvm/CodeGen/ISDOpcodes.h"		#include "llvm/CodeGen/ISDOpcodes.h"
#include "llvm/CodeGen/LowLevelType.h"		#include "llvm/CodeGen/LowLevelType.h"
#include "llvm/CodeGen/RuntimeLibcalls.h"		#include "llvm/CodeGen/RuntimeLibcalls.h"
#include "llvm/CodeGen/SelectionDAG.h"		#include "llvm/CodeGen/SelectionDAG.h"
#include "llvm/CodeGen/SelectionDAGNodes.h"		#include "llvm/CodeGen/SelectionDAGNodes.h"
▲ Show 20 Lines • Show All 3,007 Lines • ▼ Show 20 Lines	public:
virtual bool shouldKeepZExtForFP16Conv() const { return false; }		virtual bool shouldKeepZExtForFP16Conv() const { return false; }

/// Should we generate fp_to_si_sat and fp_to_ui_sat from type FPVT to type VT		/// Should we generate fp_to_si_sat and fp_to_ui_sat from type FPVT to type VT
/// from min(max(fptoi)) saturation patterns.		/// from min(max(fptoi)) saturation patterns.
virtual bool shouldConvertFpToSat(unsigned Op, EVT FPVT, EVT VT) const {		virtual bool shouldConvertFpToSat(unsigned Op, EVT FPVT, EVT VT) const {
return isOperationLegalOrCustom(Op, VT);		return isOperationLegalOrCustom(Op, VT);
}		}

		/// Does this target support complex deinterleaving
		virtual bool isComplexDeinterleavingSupported() const { return false; }

		/// Does this target support complex deinterleaving with the given operation
		/// and type
		virtual bool isComplexDeinterleavingOperationSupported(
		ComplexDeinterleavingOperation Operation, Type *Ty) const {
		return false;
		}

		/// Create the IR node for the given complex deinterleaving operation.
		/// If one cannot be created using all the given inputs, nullptr should be
		/// returned.
		virtual Value *
		createComplexDeinterleavingIR(Instruction *I,
		ComplexDeinterleavingOperation OperationType,
		unsigned Rotation, Value InputA, Value InputB,
		Value *Accumulator = nullptr) const {
		return nullptr;
		}

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Runtime Library hooks		// Runtime Library hooks
//		//

/// Rename the default libcall routine name for the specified libcall.		/// Rename the default libcall routine name for the specified libcall.
void setLibcallName(ArrayRef<RTLIB::Libcall> Calls, const char *Name) {		void setLibcallName(ArrayRef<RTLIB::Libcall> Calls, const char *Name) {
for (auto Call : Calls)		for (auto Call : Calls)
LibcallRoutineNames[Call] = Name;		LibcallRoutineNames[Call] = Name;
▲ Show 20 Lines • Show All 1,911 Lines • Show Last 20 Lines

llvm/include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 95 Lines • ▼ Show 20 Lines
	void initializeCFGPrinterLegacyPassPass(PassRegistry&);			void initializeCFGPrinterLegacyPassPass(PassRegistry&);
	void initializeCFGSimplifyPassPass(PassRegistry&);			void initializeCFGSimplifyPassPass(PassRegistry&);
	void initializeCFGuardPass(PassRegistry&);			void initializeCFGuardPass(PassRegistry&);
	void initializeCFGuardLongjmpPass(PassRegistry&);			void initializeCFGuardLongjmpPass(PassRegistry&);
	void initializeCFGViewerLegacyPassPass(PassRegistry&);			void initializeCFGViewerLegacyPassPass(PassRegistry&);
	void initializeCFIFixupPass(PassRegistry&);			void initializeCFIFixupPass(PassRegistry&);
	void initializeCFIInstrInserterPass(PassRegistry&);			void initializeCFIInstrInserterPass(PassRegistry&);
	void initializeCFLAndersAAWrapperPassPass(PassRegistry&);			void initializeCFLAndersAAWrapperPassPass(PassRegistry&);
	void initializeCFLSteensAAWrapperPassPass(PassRegistry&);			void initializeCFLSteensAAWrapperPassPass(PassRegistry &);
	void initializeCGProfileLegacyPassPass(PassRegistry &);			void initializeCGProfileLegacyPassPass(PassRegistry &);
	void initializeCallGraphDOTPrinterPass(PassRegistry&);			void initializeCallGraphDOTPrinterPass(PassRegistry &);
	void initializeCallGraphPrinterLegacyPassPass(PassRegistry&);			void initializeCallGraphPrinterLegacyPassPass(PassRegistry &);
	void initializeCallGraphViewerPass(PassRegistry&);			void initializeCallGraphViewerPass(PassRegistry &);
	void initializeCallGraphWrapperPassPass(PassRegistry&);			void initializeCallGraphWrapperPassPass(PassRegistry &);
	void initializeCallSiteSplittingLegacyPassPass(PassRegistry&);			void initializeCallSiteSplittingLegacyPassPass(PassRegistry &);
				dmgreenUnsubmitted Done Reply Inline Actions You can undo these dmgreen: You can undo these
	void initializeCalledValuePropagationLegacyPassPass(PassRegistry &);			void initializeCalledValuePropagationLegacyPassPass(PassRegistry &);
	void initializeCheckDebugMachineModulePass(PassRegistry &);			void initializeCheckDebugMachineModulePass(PassRegistry &);
	void initializeCodeGenPreparePass(PassRegistry&);			void initializeCodeGenPreparePass(PassRegistry &);
				void initializeComplexDeinterleavingLegacyPassPass(PassRegistry &);
	void initializeConstantHoistingLegacyPassPass(PassRegistry&);			void initializeConstantHoistingLegacyPassPass(PassRegistry &);
	void initializeConstantMergeLegacyPassPass(PassRegistry&);			void initializeConstantMergeLegacyPassPass(PassRegistry &);
	void initializeConstraintEliminationPass(PassRegistry &);			void initializeConstraintEliminationPass(PassRegistry &);
	void initializeControlHeightReductionLegacyPassPass(PassRegistry&);			void initializeControlHeightReductionLegacyPassPass(PassRegistry &);
				dmgreenUnsubmitted Done Reply Inline Actions You can undo these. dmgreen: You can undo these.
	void initializeCorrelatedValuePropagationPass(PassRegistry&);			void initializeCorrelatedValuePropagationPass(PassRegistry &);
	void initializeCostModelAnalysisPass(PassRegistry&);			void initializeCostModelAnalysisPass(PassRegistry &);
	void initializeCrossDSOCFIPass(PassRegistry&);			void initializeCrossDSOCFIPass(PassRegistry &);
				dmgreenUnsubmitted Done Reply Inline Actions There is a newline here? dmgreen: There is a newline here?
	void initializeCycleInfoWrapperPassPass(PassRegistry &);			void initializeCycleInfoWrapperPassPass(PassRegistry &);
	void initializeDAEPass(PassRegistry&);			void initializeDAEPass(PassRegistry &);
	void initializeDAHPass(PassRegistry&);			void initializeDAHPass(PassRegistry &);
	void initializeDCELegacyPassPass(PassRegistry&);			void initializeDCELegacyPassPass(PassRegistry&);
	void initializeDFAJumpThreadingLegacyPassPass(PassRegistry &);			void initializeDFAJumpThreadingLegacyPassPass(PassRegistry &);
	void initializeDSELegacyPassPass(PassRegistry&);			void initializeDSELegacyPassPass(PassRegistry&);
	void initializeDataFlowSanitizerLegacyPassPass(PassRegistry &);			void initializeDataFlowSanitizerLegacyPassPass(PassRegistry &);
	void initializeDeadMachineInstructionElimPass(PassRegistry&);			void initializeDeadMachineInstructionElimPass(PassRegistry&);
	void initializeDebugifyMachineModulePass(PassRegistry &);			void initializeDebugifyMachineModulePass(PassRegistry &);
	void initializeDelinearizationPass(PassRegistry&);			void initializeDelinearizationPass(PassRegistry&);
	void initializeDemandedBitsWrapperPassPass(PassRegistry&);			void initializeDemandedBitsWrapperPassPass(PassRegistry&);
	▲ Show 20 Lines • Show All 329 Lines • Show Last 20 Lines

llvm/lib/CodeGen/CMakeLists.txt

Show All 40 Lines	add_llvm_component_library(LLVMCodeGen
CFGuardLongjmp.cpp		CFGuardLongjmp.cpp
CFIFixup.cpp		CFIFixup.cpp
CFIInstrInserter.cpp		CFIInstrInserter.cpp
CodeGen.cpp		CodeGen.cpp
CodeGenCommonISel.cpp		CodeGenCommonISel.cpp
CodeGenPassBuilder.cpp		CodeGenPassBuilder.cpp
CodeGenPrepare.cpp		CodeGenPrepare.cpp
CommandFlags.cpp		CommandFlags.cpp
		ComplexDeinterleavingPass.cpp
		dmgreenUnsubmitted Done Reply Inline Actions Formatting is off here dmgreen: Formatting is off here
		dmgreenUnsubmitted Done Reply Inline Actions This formatting again? dmgreen: This formatting again?
CriticalAntiDepBreaker.cpp		CriticalAntiDepBreaker.cpp
DeadMachineInstructionElim.cpp		DeadMachineInstructionElim.cpp
DetectDeadLanes.cpp		DetectDeadLanes.cpp
DFAPacketizer.cpp		DFAPacketizer.cpp
DwarfEHPrepare.cpp		DwarfEHPrepare.cpp
EarlyIfConversion.cpp		EarlyIfConversion.cpp
EdgeBundles.cpp		EdgeBundles.cpp
EHContGuardCatchret.cpp		EHContGuardCatchret.cpp
▲ Show 20 Lines • Show All 211 Lines • Show Last 20 Lines

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp

This file was added.

				//===- ComplexDeinterleavingPass.cpp
				//------------------------------------------===//
				dmgreenUnsubmitted Done Reply Inline Actions This looks like it got formatted incorrectly. dmgreen: This looks like it got formatted incorrectly.
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/CodeGen/ComplexDeinterleavingPass.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/TargetLibraryInfo.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/CodeGen/TargetLowering.h"
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: perhaps just omit `in parallel`? Or if you want to keep it, it could benefit from a clarification. SjoerdMeijer: Nit: perhaps just omit `in parallel`? Or if you want to keep it, it could benefit from a…
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/CodeGen/TargetSubtargetInfo.h"
				#include "llvm/IR/IRBuilder.h"
				dmgreenUnsubmitted Done Reply Inline Actions Can this comment be updated now? dmgreen: Can this comment be updated now?
				#include "llvm/InitializePasses.h"
				#include "llvm/Target/TargetMachine.h"
				#include "llvm/Transforms/Scalar.h"
				#include "llvm/Transforms/Utils/Local.h"

				using namespace llvm;
				SjoerdMeijerUnsubmitted Done Reply Inline Actions I wouldn't mind a few more words/sentences about the internal datastructures, the graph, nodes, etc. SjoerdMeijer: I wouldn't mind a few more words/sentences about the internal datastructures, the graph, nodes…
				using namespace PatternMatch;

				#define DEBUG_TYPE "complex-deinterleaving"

				STATISTIC(NumComplexIntrinsics, "Number of complex intrinsics generated");

				static cl::opt<bool> ComplexArithmeticEnabled(
				"enable-complex-arithmetic",
				cl::desc("Enable generation of complex arithmetic instructions"),
				cl::init(true), cl::Hidden);

				static bool isInterleavingMask(ArrayRef<int> Mask);
				static bool isDeinterleavingMask(ArrayRef<int> Mask);

				namespace {

				/**
				* Creates an integer array of length \p len, where each item is \p step more
				* than the previous. An offset can be provided to specify the first element.
				*/
				static SmallVector<int> createArrayWithStep(int len, int step, int offset = 0) {
				SmallVector<int> Arr(len);
				for (int j = 0; j < len; j++)
				Arr[j] = (j * step) + offset;
				return Arr;
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: remove `deinterleaving` from this description? SjoerdMeijer: Nit: remove `deinterleaving` from this description?
				}

				/**
				* Creates a deinterleaving mask of the given length at the given offset.
				* A deinterleaving mask looks like <0, 2, 4, 6> or <1, 3, 5, 7>
				*/
				static SmallVector<int> createDeinterleavingMask(int len, int offset = 0) {
				return createArrayWithStep(len, 2, offset);
				}
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: newline SjoerdMeijer: Nit: newline

				class ComplexDeinterleavingLegacyPass : public FunctionPass {
				public:
				static char ID;

				ComplexDeinterleavingLegacyPass(const TargetMachine *TM = nullptr)
				: FunctionPass(ID), TM(TM) {
				initializeComplexDeinterleavingLegacyPassPass(
				*PassRegistry::getPassRegistry());
				}

				StringRef getPassName() const override { return "Complex Arithmetic Pass"; }

				bool runOnFunction(Function &F) override;
				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<TargetLibraryInfoWrapperPass>();
				AU.setPreservesCFG();
				}

				private:
				const TargetMachine *TM;
				};

				enum OperatingComponent { Real, Imaginary, Unknown };

				class ComplexDeinterleavingGraph;
				struct ComplexDeinterleavingCompositeNode {

				private:
				ComplexDeinterleavingCompositeNode(ComplexDeinterleavingOperation Op)
				: Operation(Op) {}

				friend class ComplexDeinterleavingGraph;

				public:
				SmallVector<Value *> getOperands() {
				SmallVector<Value *> Ops;

				for (const auto &item : ContainedInstructions) {
				dmgreenUnsubmitted Done Reply Inline Actions item -> Item dmgreen: item -> Item
				for (unsigned i = 0; i < item->getNumOperands(); i++) {
				dmgreenUnsubmitted Done Reply Inline Actions i -> I (or a better name, if I is already used) dmgreen: i -> I (or a better name, if I is already used)
				auto *V = item->getOperand(i);
				auto *I = dyn_cast<Instruction>(V);
				if (!I \|\| !contains(I)) {
				Ops.push_back(V);
				continue;
				}
				dmgreenUnsubmitted Done Reply Inline Actions I think it would be best to store the root Real and Imag values as part of the ComplexDeinterleavingCompositeNode. That way we can be sure we are matching a correct node from anything like getContainingComposite. There can then be the TransientInstructions (maybe "ExtraInstructions"), which are also part of the pattern, but not the root nodes. dmgreen: I think it would be best to store the root Real and Imag values as part of the…
				}
				}
				return Ops;
				}
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Can you say what these instructions could be? SjoerdMeijer: Can you say what these instructions could be?

				Value *getOperand(unsigned Idx) { return getOperands()[Idx]; }

				unsigned getNumOperands() { return getOperands().size(); }

				dmgreenUnsubmitted Done Reply Inline Actions Could AccumulatorNode just be another Operand? dmgreen: Could AccumulatorNode just be another Operand?
				SmallVector<Instruction *> ContainedInstructions;
				dmgreenUnsubmitted Done Reply Inline Actions I think this would be best to avoid NodePtr, if that might keep a reference to a node alive. Avoiding shared_ptr entirely would be good, but I imagine any alternative might be difficult too. They needs to be stable values to avoid memory invalidation issues. dmgreen: I think this would be best to avoid NodePtr, if that might keep a reference to a node alive.
				Value *OutputNode = nullptr;
				Value *OriginalInput0 = nullptr;
				Value *OriginalInput1 = nullptr;
				Value *ReplacementNode = nullptr;
				bool IsTopLevel = false;
				ComplexDeinterleavingOperation Operation;

				bool UsesNegation = false;
				unsigned Rotation = 0;
				Value *Input0 = nullptr;
				Value *Input1 = nullptr;
				Value *Accumulator = nullptr;
				dmgreenUnsubmitted Done Reply Inline Actions Can we just use nullptr as the sentinel value for IsValid = false, removing the need for this struct? If we need some way to represent the last shuffle node then we could add a Shuffle or Leaf node type (which could itself have a ReplacementNode just set to the shuffle->operand(0)). Hopefully that also simplifies resolveInputs / Elevate. Especially if Accumulator is treated in the same way as an Operand. Hopefully that is simpler. dmgreen: Can we just use nullptr as the sentinel value for IsValid = false, removing the need for this…
				Value *Accumulatee = nullptr;

				void addInstruction(Instruction *I) { ContainedInstructions.push_back(I); }
				bool contains(Instruction *I) {
				if (I == ReplacementNode)
				return true;

				return std::find(ContainedInstructions.begin(), ContainedInstructions.end(),
				chillUnsubmitted Done Reply Inline Actions Here and elsewhere `std::find` -> `llvm::find` chill: Here and elsewhere `std::find` -> `llvm::find`
				I) != ContainedInstructions.end();
				}
				};

				class ComplexDeinterleavingGraph {
				private:
				using NodePtr = std::shared_ptr<ComplexDeinterleavingCompositeNode>;

				SmallVector<Instruction *> Instructions;
				SmallVector<NodePtr> CompositeNodes;

				llvm::TargetTransformInfo::TargetCostKind CostKind =
				llvm::TargetTransformInfo::TCK_Latency;
				dmgreenUnsubmitted Done Reply Inline Actions Reciprocal Throughput is more common. dmgreen: Reciprocal Throughput is more common.

				InstructionCost CostOfIntrinsics;

				/// Determines the operating component of the given Value.
				/// This is achieved by looking at the operating component of the Value's
				/// operands and, based on the instruction, evaluates what the resulting
				/// component would be.
				OperatingComponent getOperatingComponentOfValue(Value *V) {
				Instruction *I = dyn_cast_or_null<Instruction>(V);
				if (!I)
				return Unknown;

				if (auto *Shuffle = dyn_cast<ShuffleVectorInst>(I)) {
				auto ShuffleMask = Shuffle->getShuffleMask();
				dmgreenUnsubmitted Done Reply Inline Actions ConvergingI is the Root of the graph? Perhaps just call it RootValue. dmgreen: ConvergingI is the Root of the graph? Perhaps just call it RootValue.
				if (isDeinterleavingMask(ShuffleMask)) {
				if (ShuffleMask[0] == 0)
				return Real;
				if (ShuffleMask[0] == 1)
				return Imaginary;
				}
				dmgreenUnsubmitted Done Reply Inline Actions Why is this needed? When are the Real and Imag values not already correct? It seems strange to be able to reverse them. dmgreen: Why is this needed? When are the Real and Imag values not already correct? It seems strange to…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions It was a holdover function that I reused for the `.Valid` check. It's not used anywhere else so I've removed it (and the relevant struct), and moved the relevant check to here. NickGuy: It was a holdover function that I reused for the `.Valid` check. It's not used anywhere else so…
				return Unknown;
				}
				dmgreenUnsubmitted Done Reply Inline Actions Formatting needs updating. dmgreen: Formatting needs updating.

				if (I->getOpcode() == Instruction::FMul) {
				auto Op0Component = getOperatingComponentOfValue(I->getOperand(0));
				SjoerdMeijerUnsubmitted Done Reply Inline Actions I haven't looked carefully, but why do we also need to record all instructions here? I mean, the graph can be queried for existence of nodes/instructions, or is this bookkeeping duplication more efficient? SjoerdMeijer: I haven't looked carefully, but why do we also need to record all instructions here? I mean…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions The duplication makes it simpler to check the internal uses of a node (See `ComplexDeinterleavingCompositeNode::hasAllInternalUses`). Without it, we'd need to iterate over the nodes twice, and check the instructions of each node (All of `Real`, `Imag`, and the contents of `InternalInstructions`), while this approach allows us to simply say "Is this instruction known within the graph?". NickGuy: The duplication makes it simpler to check the internal uses of a node (See…
				auto Op1Component = getOperatingComponentOfValue(I->getOperand(1));
				if (Op0Component == Unknown \|\| Op1Component == Unknown)
				return Unknown;
				if (Op0Component == Op1Component)
				return Real;
				return Imaginary;
				}

				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: no validation going on here, only insertion? SjoerdMeijer: Nit: no validation going on here, only insertion?
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: no validation going on here, only insertion? SjoerdMeijer: Nit: no validation going on here, only insertion?
				NickGuyAuthorUnsubmitted Done Reply Inline Actions Good catch, the validation that was here was moved to be more tied into the identification. Renamed NickGuy: Good catch, the validation that was here was moved to be more tied into the identification.
				if (I->getOpcode() == Instruction::FNeg)
				return getOperatingComponentOfValue(I->getOperand(0));

				if (I->getOpcode() == Instruction::FAdd \|\|
				I->getOpcode() == Instruction::FSub) {
				auto Op0Component = getOperatingComponentOfValue(I->getOperand(0));
				auto Op1Component = getOperatingComponentOfValue(I->getOperand(1));
				if (Op0Component != Op1Component \|\| Op1Component == Unknown)
				return Unknown;
				return Op0Component;
				}

				return Unknown;
				}

				void addInstruction(Instruction *I) { Instructions.push_back(I); }

				void sortCompositeNodes(BasicBlock *B) {
				SmallVector<NodePtr> NewNodeList;

				// Sort the nodelist based on the instruction order
				for (auto &I : *B) {
				if (auto CN = findNodeFromOutput(&I))
				NewNodeList.push_back(CN);
				}

				for (unsigned i = 0; i < NewNodeList.size(); i++)
				CompositeNodes[i] = NewNodeList[i];
				}

				NodePtr findNodeFromOutput(Instruction *I) {
				for (const auto &Item : CompositeNodes) {
				if (Item->OutputNode == I)
				return Item;
				}

				return nullptr;
				}
				dmgreenUnsubmitted Done Reply Inline Actions If NodeInstructions is used to calculate all the Instructions we have seen, to check that all the uses are inside the graph, it could be a SmallPtrSet which will have quicker contains(). dmgreen: If NodeInstructions is used to calculate all the Instructions we have seen, to check that all…

				SmallVector<Instruction *> findUnmatchedInstructions() {
				SmallVector<Instruction *> Is;
				for (auto &I : Instructions) {
				if (shouldIgnoreValue(I))
				continue;
				if (getContainingComposite(I) == nullptr)
				Is.push_back(I);
				}
				return Is;
				}

				Value getSharedOperand(Instruction A, Instruction *B, unsigned &Idx) {
				if (A->getNumOperands() != B->getNumOperands())
				return nullptr;

				for (unsigned i = 0; i < A->getNumOperands(); i++) {
				auto *Op = A->getOperand(i);
				if (Op == B->getOperand(i)) {
				Idx = i;
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: newline? SjoerdMeijer: Nit: newline?
				return Op;
				}
				}
				return nullptr;
				}

				bool haveSharedUses(Value A, Value B) {
				if (A->hasOneUser() && B->hasOneUser()) {
				auto AUser = A->user_begin();
				auto BUser = B->user_begin();

				if (AUser && AUser == BUser)
				return true;

				auto AUCN = getContainingComposite(dyn_cast<Instruction>(AUser));
				auto BUCN = getContainingComposite(dyn_cast<Instruction>(BUser));

				if (AUCN && AUCN == BUCN)
				return true;
				}

				if (A->getNumUses() != B->getNumUses())
				return false;

				for (const auto &AUser : A->users()) {
				bool Found = false;
				auto AUCN = getContainingComposite(dyn_cast<Instruction>(AUser));
				for (const auto &BUser : B->users()) {
				if (AUser == BUser) {
				Found = true;
				break;
				}
				auto BUCN = getContainingComposite(dyn_cast<Instruction>(BUser));
				if (AUCN && AUCN == BUCN) {
				Found = true;
				break;
				}
				}
				if (!Found) {
				LLVM_DEBUG(dbgs() << "AUser doesn't have a match: "; AUser->dump());
				return false;
				}
				}

				return true;
				}

				Value followUseChain(Value V) {
				dmgreenUnsubmitted Done Reply Inline Actions It is hard to see how wrapping a Value as an _Incomplete node is always correct. It will have lost the information about whether V was real or imaginary. dmgreen: It is hard to see how wrapping a Value as an _Incomplete node is always correct. It will have…
				if (V->hasOneUser())
				return followUseChain(*V->user_begin());

				// TODO handle multiple users, but how?

				return V;
				}

				Value getFinalInputReplacement(Instruction I) {
				for (unsigned OpIdx = I->getNumOperands() - 1; OpIdx >= 0; OpIdx--) {
				auto *Op = dyn_cast<Instruction>(I->getOperand(OpIdx));
				while (Op && shouldIgnoreValue(Op))
				Op = dyn_cast<Instruction>(Op->getOperand(0));
				if (Op == nullptr)
				continue;

				auto CN = getContainingComposite(Op);
				if (CN == nullptr \|\| CN->ReplacementNode == nullptr)
				continue;
				return followUseChain(CN->ReplacementNode);
				}

				return nullptr;
				}

				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: can omit the comparison with 1. SjoerdMeijer: Nit: can omit the comparison with 1.
				Value getReplacement(Instruction I) {
				if (!I)
				return nullptr;
				auto CN = getContainingComposite(I);
				if (CN == nullptr \|\| CN->ReplacementNode == nullptr)
				return I;
				return CN->ReplacementNode;
				}

				std::shared_ptr<ComplexDeinterleavingCompositeNode>
				prepareCompositeNode(ComplexDeinterleavingOperation Operation) {
				return std::shared_ptr<ComplexDeinterleavingCompositeNode>(
				chillUnsubmitted Done Reply Inline Actions `return std::make_shared<ComplexDeinterleavingCompositeNode>(Operation);` chill: `return std::make_shared<ComplexDeinterleavingCompositeNode>(Operation);`
				new ComplexDeinterleavingCompositeNode(Operation));
				}

				void
				submitCompositeNode(std::shared_ptr<ComplexDeinterleavingCompositeNode> CN) {
				CompositeNodes.push_back(CN);
				}

				bool containsNode(Instruction *I) {
				return std::find(Instructions.begin(), Instructions.end(), I) !=
				Instructions.end();
				}

				/// Certain values, such as extends and truncates, should be ignored within
				/// the graph for our needs as they contribute towards structure rather than
				/// function.
				///
				/// e.g. A deinterleaving shuffle provides no functionality itself,
				/// and does not need to be explicitly handled beyond the usual operations. A
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: I am a fan of reducing indentation, e.g.: auto SVI = dyn_cast<ShuffleVectorInst>(&I); if (!SVI) continue; if (!isInterleavingMask(SVI->getShuffleMask()) continue; etc. SjoerdMeijer:* Nit: I am a fan of reducing indentation, e.g.: auto *SVI = dyn_cast<ShuffleVectorInst>(&I)…
				/// shuffle that is neither interleaving nor deinterleaving is an example of
				/// one that needs to be handled, and thus should not be ignored.
				bool shouldIgnoreValue(Value *V) {
				if (isa<InsertElementInst>(V))
				return true;

				if (auto *SVI = dyn_cast<ShuffleVectorInst>(V)) {
				auto Mask = SVI->getShuffleMask();
				return isInterleavingMask(Mask) \|\| isDeinterleavingMask(Mask);
				}

				if (auto *I = dyn_cast<Instruction>(V)) {
				auto Opc = I->getOpcode();
				return I->isCast() \|\| Opc == Instruction::FPTrunc \|\|
				Opc == Instruction::FPExt;
				}
				return false;
				}

				/// Checks the users of the given instructions to evaluate whether the
				/// returns from said instructions converge at any point. e.g. in a shuffle
				bool doInstructionsConverge(Instruction A, Instruction B) {
				if (A->hasOneUser() && B->hasOneUser()) {
				auto AUser = A->user_begin();
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: maybe I overlooked something, but this doesn't seem to use any state from the class so could be just a helper function? SjoerdMeijer: Nit: maybe I overlooked something, but this doesn't seem to use any state from the class so…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions It populates the compositeNode structure through (formerly) `validateAndSubmitCompositeNode`, and also calls `identifyNode` further, which does use state from the class. NickGuy: It populates the compositeNode structure through (formerly) `validateAndSubmitCompositeNode`…
				auto BUser = B->user_begin();

				while (shouldIgnoreValue(AUser))
				AUser = *AUser->user_begin();
				while (shouldIgnoreValue(BUser))
				BUser = *BUser->user_begin();

				if (AUser == BUser)
				return true;
				}

				return haveSharedUses(A, B);
				}

				NodePtr getContainingComposite(Instruction *I) {
				if (I == nullptr)
				return nullptr;
				for (const auto &CN : CompositeNodes) {
				if (CN->contains(I))
				return CN;
				if (CN->ReplacementNode == I)
				return CN;
				}
				return nullptr;
				}

				public:
				/// Step through the use-def chains to find all instruction nodes converging
				/// on \p I.
				void discoverNodes(BasicBlock B, Instruction I);
				/// Iterate over the nodes and reducing them to complex nodes where possible.
				/// Returns false if the deinterleaving operation should be cancelled for the
				/// current graph.
				bool reduceNodes(const TargetLowering *TL);
				/// Perform the actual replacement of the underlying instruction graph.
				/// Returns false if the deinterleaving operation should be cancelled for the
				/// current graph.
				bool replaceNodes(const TargetLowering *TL);
				void getDeadRoots(SmallVector<Instruction *> &DeadInstRoots);
				};

				class ComplexDeinterleaving {
				public:
				ComplexDeinterleaving(const TargetLowering tl, const TargetLibraryInfo tli)
				: TL(tl), TLI(tli) {}
				dmgreenUnsubmitted Done Reply Inline Actions If you can turn some of these into better sentances that would be good. Perhaps "Mul instruction has multiple uses" like you have below. dmgreen: If you can turn some of these into better sentances that would be good. Perhaps "Mul…
				bool runOnFunction(Function &F);

				private:
				bool evaluateComplexDeinterleavingBasicBlock(BasicBlock *B);

				const TargetLowering *TL = nullptr;
				const TargetLibraryInfo *TLI = nullptr;
				};

				} // namespace

				char ComplexDeinterleavingLegacyPass::ID = 0;

				INITIALIZE_PASS_BEGIN(ComplexDeinterleavingLegacyPass, DEBUG_TYPE,
				"Complex Deinterleaving", false, false)
				INITIALIZE_PASS_END(ComplexDeinterleavingLegacyPass, DEBUG_TYPE,
				"Complex Deinterleaving", false, false)

				PreservedAnalyses ComplexDeinterleavingPass::run(Function &F,
				dmgreenUnsubmitted Done Reply Inline Actions Similar to identifyPartialMul, this probably needs to be more careful about what it is selecting as which operands. dmgreen: Similar to identifyPartialMul, this probably needs to be more careful about what it is…
				FunctionAnalysisManager &AM) {
				const TargetLowering *TL = TM->getSubtargetImpl(F)->getTargetLowering();
				auto &TLI = AM.getResult<llvm::TargetLibraryAnalysis>(F);
				if (!ComplexDeinterleaving(TL, &TLI).runOnFunction(F))
				return PreservedAnalyses::all();

				dmgreenUnsubmitted Done Reply Inline Actions Should this function be calling identifyNode on the inputs, similar to identifyPartialMul? dmgreen: Should this function be calling identifyNode on the inputs, similar to identifyPartialMul?
				NickGuyAuthorUnsubmitted Done Reply Inline Actions Not here, no. The inputs need to be resolved as part of the whole pair, so the relevant one for the created node is passed back via CommonOperandI. NickGuy: Not here, no. The inputs need to be resolved as part of the whole pair, so the relevant one for…
				dmgreenUnsubmitted Done Reply Inline Actions What about the other pair? They are not guaranteed to be the same as the operands from identifyPartialMul as far as I can see. I would probably make sure that the two operands are a valid match too. The second call to identifyNode with the same operands should just find the same node again. And that way this can have proper operands. Currently this and identifyPartialMul are very much interlinked. I hope we can improve that in the future, but it sounds like an extension that can be thought about later. dmgreen: What about the other pair? They are not guaranteed to be the same as the operands from…
				PreservedAnalyses PA;
				PA.preserve<FunctionAnalysisManagerModuleProxy>();
				return PA;
				}

				FunctionPass llvm::createComplexDeinterleavingPass(const TargetMachine TM) {
				return new ComplexDeinterleavingLegacyPass(TM);
				}

				bool ComplexDeinterleavingLegacyPass::runOnFunction(Function &F) {
				const auto *TL = TM->getSubtargetImpl(F)->getTargetLowering();
				auto TLI = getAnalysis<TargetLibraryInfoWrapperPass>().getTLI(F);
				return ComplexDeinterleaving(TL, &TLI).runOnFunction(F);
				}

				bool ComplexDeinterleaving::runOnFunction(Function &F) {
				if (!ComplexArithmeticEnabled) {
				LLVM_DEBUG(dbgs() << "Complex has been explicitly disabled.\n");
				return false;
				}

				if (!TL->isComplexDeinterleavingSupported()) {
				LLVM_DEBUG(dbgs() << "Complex has been disabled, target does not support "
				"lowering of complex numbers.\n");
				return false;
				}

				bool Changed = false;
				for (auto &B : F)
				Changed \|= evaluateComplexDeinterleavingBasicBlock(&B);
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: I am wondering if these casts here and below are necessary. SjoerdMeijer: Nit: I am wondering if these casts here and below are necessary.
				NickGuyAuthorUnsubmitted Done Reply Inline Actions You are correct, good catch. NickGuy: You are correct, good catch.

				return Changed;
				}

				/**
				dmgreenUnsubmitted Done Reply Inline Actions When can this be false? dmgreen: When can this be false?
				NickGuyAuthorUnsubmitted Done Reply Inline Actions In it's current state, never. It was possible before moving to the implicitAdd approach NickGuy: In it's current state, never. It was possible before moving to the implicitAdd approach
				* Checks the given mask, and determines whether said mask is interleaving.
				*
				* To be interleaving, a mask must alternate between `i` and `i + (Length / 2)`,
				* and must contain all numbers within the range of `[0..Length)`
				* (e.g. a 4x vector interleaving mask would be <0, 2, 1, 3>).
				*/
				static bool isInterleavingMask(ArrayRef<int> Mask) {
				int HalfNumElements = Mask.size() / 2;

				dmgreenUnsubmitted Done Reply Inline Actions Formatting is a bit long. dmgreen: Formatting is a bit long.
				for (int Idx = 0; Idx < HalfNumElements; ++Idx) {
				huntergrUnsubmitted Done Reply Inline Actions I can't find any place where rotations are dealt with besides assignment and equality checks of 0,90,180,270 -- might an enum be preferable? Or do you anticipate doing arithmetic with them in a follow-up patch? huntergr: I can't find any place where rotations are dealt with besides assignment and equality checks of…
				if (Mask[(Idx * 2) + 1] != (Mask[Idx * 2] + HalfNumElements))
				SjoerdMeijerUnsubmitted Done Reply Inline Actions nit: maybe some or all of these node setting things can be done in a constructor. SjoerdMeijer: nit: maybe some or all of these node setting things can be done in a constructor.
				NickGuyAuthorUnsubmitted Done Reply Inline Actions Not sure I agree with that, there are cases - such as a `Shuffle` - where none of the other settings are relevant. The operation type, real instruction, and imaginary instructions are the only settings relevant to all cases. NickGuy: Not sure I agree with that, there are cases - such as a `Shuffle` - where none of the other…
				return false;
				}

				return true;
				}

				/**
				* Checks the given mask, and determines whether said mask is deinterleaving.
				*
				* To be interleaving, a mask must increment in steps of 2, and either start
				* with 0 or 1.
				* (e.g. an 8x vector deinterleaving mask would be either <0, 2, 4, 6> or
				* <1, 3, 5, 7>).
				*/
				static bool isDeinterleavingMask(ArrayRef<int> Mask) {
				int Offset = Mask[0];
				int HalfNumElements = Mask.size() / 2;

				for (int Idx = 1; Idx < HalfNumElements; ++Idx) {
				if (Mask[Idx] != (Idx * 2) + Offset)
				return false;
				}

				return true;
				}

				bool ComplexDeinterleaving::evaluateComplexDeinterleavingBasicBlock(
				dmgreenUnsubmitted Done Reply Inline Actions evaluateComplexDeinterleavingBasicBlock -> evaluateBasicBlock dmgreen: evaluateComplexDeinterleavingBasicBlock -> evaluateBasicBlock
				BasicBlock *B) {
				bool Changed = false;

				SmallVector<Instruction *> DeadInstrRoots;

				for (auto &I : *B) {
				if (auto *SVI = dyn_cast<ShuffleVectorInst>(&I)) {
				// Look for a shufflevector that takes separate vectors of the real and
				// imaginary components and recombines them into a single vector.
				if (isInterleavingMask(SVI->getShuffleMask())) {
				ComplexDeinterleavingGraph Graph;
				Graph.discoverNodes(B, SVI);
				if (Graph.reduceNodes(TL) && Graph.replaceNodes(TL)) {
				Changed = true;
				DeadInstrRoots.push_back(SVI);
				} else {
				SmallVector<Instruction *> DeadInstrs;
				Graph.getDeadRoots(DeadInstrs);
				dmgreenUnsubmitted Done Reply Inline Actions What if the operands are FMul(A, A) and FMul (B, C)? I think it needs to be more precise about which nodes are which. dmgreen: What if the operands are FMul(A, A) and FMul (B, C)? I think it needs to be more precise about…
				dmgreenUnsubmitted Done Reply Inline Actions I still think this needs to be a more precise with regard to what is considered the CommonOperand and CommonOperand. dmgreen: I still think this needs to be a more precise with regard to what is considered the…
				for (auto It = DeadInstrs.rbegin(); It != DeadInstrs.rend(); It++)
				(*It)->eraseFromParent();
				}
				}
				}
				}

				for (const auto &I : DeadInstrRoots)
				llvm::RecursivelyDeleteTriviallyDeadInstructions(I, TLI);

				return Changed;
				SjoerdMeijerUnsubmitted Done Reply Inline Actions Nits: Op -> Operand, not Instruction -> not a Instruction? SjoerdMeijer: Nits: Op -> Operand, not Instruction -> not a Instruction?
				}

				void ComplexDeinterleavingGraph::discoverNodes(BasicBlock B, Instruction I) {

				if (I->getParent() != B)
				return;

				if (containsNode(I))
				return;

				if (isa<LoadInst>(I) \|\| isa<PHINode>(I)) {
				// No need to discover beyond a load or a phi.
				return;
				}

				addInstruction(I);

				if (auto *SVI = dyn_cast<ShuffleVectorInst>(I)) {
				auto ShuffleMask = SVI->getShuffleMask();

				dmgreenUnsubmitted Done Reply Inline Actions These can be // comments. They are very useful, but maybe it is not necessary to repeat them. dmgreen: These can be // comments. They are very useful, but maybe it is not necessary to repeat them.
				auto RealMask = createDeinterleavingMask(ShuffleMask.size());
				chillUnsubmitted Done Reply Inline Actions These could be, e.g.: static const int RealMask[] = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}; auto RealMaskRef = ArrayRef<int>(RealMask, ShufleMask.size()); with an assertion/bounds check. Good enough for 512-bit vectors with 16-bit elements, can be extended. chill: These could be, e.g.: ``` static const int RealMask[] = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20…
				auto ImagMask = createDeinterleavingMask(ShuffleMask.size(), 1);
				ArrayRef<int> RealMaskRef(RealMask);
				ArrayRef<int> ImagMaskRef(ImagMask);
				dmgreenUnsubmitted Done Reply Inline Actions I don't think it makes sense to treat these backwards. We know the CR part should be real and the CI is imaginary. It would only be valid to treat them the other way around if there was some sort of shuffle added. dmgreen: I don't think it makes sense to treat these backwards. We know the CR part should be real and…

				Value *ShuffleSource;
				if (match(SVI, m_Shuffle(m_Value(ShuffleSource), m_Poison(),
				m_SpecificMask(RealMaskRef))) \|\|
				match(SVI, m_Shuffle(m_Value(ShuffleSource), m_Poison(),
				m_SpecificMask(ImagMaskRef)))) {
				// Reached "top" of graph, stop discovering.
				dmgreenUnsubmitted Done Reply Inline Actions Is it valid to test the operands in the opposite order? dmgreen: Is it valid to test the operands in the opposite order?
				NickGuyAuthorUnsubmitted Done Reply Inline Actions No, but it's also not invalid (the checks do what they're supposed to and don't produce a node). I've removed these. NickGuy: No, but it's also not invalid (the checks do what they're supposed to and don't produce a node).
				// TODO this check needs refining
				if (ShuffleSource &&
				(isa<LoadInst>(ShuffleSource) \|\| !isa<Instruction>(ShuffleSource)))
				return;
				}
				dmgreenUnsubmitted Done Reply Inline Actions Do you have any tests for fneg? This should probably be more like the code in identifyPartialMul where the rotation is chosen based on whether the Real/Imag parts are negated. dmgreen: Do you have any tests for fneg? This should probably be more like the code in…
				}

				for (const auto &Op : I->operands()) {
				if (auto *OpI = dyn_cast<Instruction>(Op))
				discoverNodes(B, OpI);
				}
				}

				bool ComplexDeinterleavingGraph::reduceNodes(const TargetLowering *TL) {
				if (Instructions.empty()) {
				LLVM_DEBUG(dbgs() << "No Instructions, cannot reduce.\n");
				return false;
				}

				dmgreenUnsubmitted Done Reply Inline Actions I think I was expecting Operands to be the ComplexDeinterleavingCompositeNode* from identifyNode. That avoids the need to find the again later. dmgreen: I think I was expecting Operands to be the ComplexDeinterleavingCompositeNode* from…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions Attempting to plumb something together that acceps both `ComplexDeinterleavingCompositeNode` and `Value` caused more bloat and made things more difficult to follow. And we can't simply use `Node->ReplacementNode` as the `Value` because it hasn't been assigned yet. Because of that constraint, I'm opting to find the nodes later, though I'm open to ideas. NickGuy:* Attempting to plumb something together that acceps both `ComplexDeinterleavingCompositeNode*`…
				dmgreenUnsubmitted Done Reply Inline Actions I was expecting it to just accept ComplexDeinterleavingCompositeNode. If all the operands are ComplexDeinterleavingCompositeNode, then they just need to be visited in the correct order to assign ReplacementNode based on how they need to be transformed. (Which I believe is fine providing we visit them in reverse order, like is already done). i.e. we construct a graph made up of ComplexDeinterleavingCompositeNode, then transform that graph. That seems like a simpler, more extensible design going forward. It gets more complicated if there is a mixing of values between the original IR and the newly constructed intrinsics. I could imagine that might make graphs (as opposed to DAGs) more difficult, but we don't yet support any cycles. dmgreen: I was expecting it to just accept ComplexDeinterleavingCompositeNode. If all the operands are…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions I've converted the pseudo-graph (a list with lookups) to an actual graph structure, with every node being a ComplexDeinterleavingCompositeNode, meaning that we're no longer mixing `Value`s and `ComplexDeinterleavingCompositeNode`s. NickGuy:* I've converted the pseudo-graph (a list with lookups) to an actual graph structure, with every…
				auto *ConvergingI = Instructions[0];
				dmgreenUnsubmitted Done Reply Inline Actions Formatting. dmgreen: Formatting.

				for (auto &I : *ConvergingI->getParent()) {
				chillUnsubmitted Done Reply Inline Actions Why are we scanning the whole block? It looks to me (admittedly I don't understand what this pass does yet) we can just walk over the `Instructions` vector (in reverse, if the order matters), avoiding the quadratic complexity. chill: Why are we scanning the whole block? It looks to me (admittedly I don't understand what this…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions The order of the `CompositeNode`s is important, this was a holdover from before we were sorting the nodes. NickGuy: The order of the `CompositeNode`s is important, this was a holdover from before we were sorting…
				dmgreenUnsubmitted Done Reply Inline Actions Does this need to be calling identifyNode on the third argument too? The one derived from UncommonValues. dmgreen: Does this need to be calling identifyNode on the third argument too? The one derived from…

				if (!containsNode(&I))
				continue;

				auto *N = &I;
				if ((match(N, m_FAdd(m_FMul(m_Value(), m_Value()),
				m_FMul(m_Value(), m_Value()))) \|\|
				match(N, m_FSub(m_FMul(m_Value(), m_Value()),
				m_FMul(m_Value(), m_Value()))))) {

				auto *VTy = dyn_cast<FixedVectorType>(N->getType());
				if (!VTy)
				continue;

				auto *NewVTy =
				FixedVectorType::get(VTy->getScalarType(), VTy->getNumElements() * 2);

				if (!TL->isComplexDeinterleavingOperationSupported(
				ComplexDeinterleavingOperation::CMulPartial, NewVTy)) {
				dbgs() << "Type isn't valid for CMulPartial: ";
				NewVTy->dump();
				continue;
				}

				LLVM_DEBUG(dbgs() << "Composite node built up from "; N->dump());
				auto CN = prepareCompositeNode(
				llvm::ComplexDeinterleavingOperation::CMulPartial);

				auto *Op0 = cast<Instruction>(N->getOperand(0));
				auto *Op1 = cast<Instruction>(N->getOperand(1));

				dmgreenUnsubmitted Done Reply Inline Actions Should these be checking the subnodes? if (!identifyNode(..)) return nullptr; dmgreen: Should these be checking the subnodes? ``` if (!identifyNode(..)) return nullptr; ```
				CN->addInstruction(N);
				CN->addInstruction(Op0);
				CN->addInstruction(Op1);

				CN->OriginalInput0 = Op0;
				CN->OriginalInput1 = Op1;

				bool ContainsNeg = false;
				for (unsigned i = 0; i < Op0->getNumOperands(); i++) {
				chillUnsubmitted Done Reply Inline Actions Here and in a few other places `for (Value V : Op0->operands()) {` chill:* Here and in a few other places `for (Value *V : Op0->operands()) { `
				auto *Op = dyn_cast<Instruction>(Op0->getOperand(i));
				if (Op && Op->getOpcode() == Instruction::FNeg) {
				if (ContainsNeg)
				break;
				CN->addInstruction(Op);
				CN->OriginalInput0 = Op;
				ContainsNeg = true;
				}
				}
				for (unsigned i = 0; i < Op1->getNumOperands(); i++) {
				auto *Op = dyn_cast<Instruction>(Op1->getOperand(i));
				if (Op && Op->getOpcode() == Instruction::FNeg) {
				if (ContainsNeg)
				break;
				dmgreenUnsubmitted Done Reply Inline Actions CommonOperandI only seems to be important between identifyPartialMul and identifyNodeWithImplicitAdd dmgreen: CommonOperandI only seems to be important between identifyPartialMul and…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions It used to be needed all through, but that is no longer the case. Changed the signatures and such to reflect this. NickGuy: It used to be needed all through, but that is no longer the case. Changed the signatures and…
				CN->addInstruction(Op);
				CN->OriginalInput1 = Op;
				ContainsNeg = true;
				}
				}

				if (!ContainsNeg) {
				auto &Use = (*N->use_begin());
				if (N->getOpcode() == Instruction::FSub) {
				dmgreenUnsubmitted Done Reply Inline Actions This looks like it needs to match: RealI and ImagI are both shuffles. They both have the same Operand0 They both have "deinterleaving" masks. I don't think the type of the value of Operand0 otherwise matters. It doesn't matter if it is a Load or a Argument, we can always just use it. dmgreen: This looks like it needs to match: - RealI and ImagI are both shuffles. - They both have the…
				if (isa<ShuffleVectorInst>(Use.getUser()) &&
				Use.getOperandNo() != 0) {
				LLVM_DEBUG(dbgs()
				<< "First converging shuffle operand should be an FSub"
				<< ".\n");
				return false;
				}
				} else if (N->getOpcode() == Instruction::FAdd) {
				if (isa<ShuffleVectorInst>(Use.getUser()) &&
				Use.getOperandNo() != 1) {
				LLVM_DEBUG(dbgs()
				<< "Second converging shuffle operand should be an FAdd"
				<< ".\n");
				return false;
				}
				}
				}
				dmgreenUnsubmitted Done Reply Inline Actions It's not the combination of rotates that is invalid really, although some might make less sense than others, and some might simplify to code that is difficult to match. The even rotations will define the real part of `a`, the odd rotations define the imag part. Between the two parts of the multiply we match we need to find both the real and imag halves of the value, to successfully match it further. (We may be able to do something with only half a match in some cases, like matching leafs, but that should probably be left for a later patch). dmgreen: It's not the combination of rotates that is invalid really, although some might make less sense…

				auto Pattern = m_BinOp(m_Shuffle(m_Value(), m_Value()),
				m_Shuffle(m_Value(), m_Value()));
				CN->IsTopLevel = match(CN->OriginalInput0, Pattern) &&
				match(CN->OriginalInput1, Pattern);
				CN->UsesNegation = ContainsNeg;
				CN->OutputNode = N;

				dmgreenUnsubmitted Done Reply Inline Actions These need to check that the first is the real deinterleave with offset=0, and the imag has offset=1. And maybe that they only take elements from the first operand and don't change size. dmgreen: These need to check that the first is the real deinterleave with offset=0, and the imag has…
				CN->Rotation = (N->getOpcode() == Instruction::FAdd) * 90;

				if (N->getOpcode() == Instruction::FSub) {
				auto *SubOp0 = cast<Instruction>(N->getOperand(0));
				auto SubOp0C0 = getOperatingComponentOfValue(SubOp0->getOperand(0));
				auto SubOp0C1 = getOperatingComponentOfValue(SubOp0->getOperand(1));

				if (SubOp0C0 == SubOp0C1) {
				if (SubOp0C0 == OperatingComponent::Imaginary) {
				CN->Rotation += 90;
				huntergrUnsubmitted Done Reply Inline Actions I think it would be worthwhile to add a comment about what the interleaving represents -- that you're looking for a shuffle that takes separate vectors of real and imaginary parts and combines them before they are stored to memory (or returned in registers), and that this is just for matching per-lane operations instead of cross-lane (like a reduction). Or at least that's the behaviour I've observed when testing your patch with a loop like the following: #define LEN (512) float _Complex a[ LEN ]; float _Complex b[ LEN ]; float _Complex c[ LEN ]; void foo (void) { #pragma clang loop vectorize(enable) for (int i = 0; i < LEN; ++i) a[i] = b[i] * c[i]; } huntergr: I think it would be worthwhile to add a comment about what the interleaving represents -- that…
				}
				}
				}

				if (CN->UsesNegation)
				CN->Rotation += 180;

				submitCompositeNode(CN);
				}
				}

				auto Unmatched = findUnmatchedInstructions();
				SmallVector<Instruction *> Pairs;

				for (auto &I : Unmatched) {
				if (std::find(Pairs.begin(), Pairs.end(), I) != Pairs.end())
				continue;
				for (auto &J : Unmatched) {
				if (I == J \|\| std::find(Pairs.begin(), Pairs.end(), J) != Pairs.end())
				continue;

				if (doInstructionsConverge(I, J)) {
				Pairs.push_back(I);
				Pairs.push_back(J);
				break;
				}
				dmgreenUnsubmitted Done Reply Inline Actions These seem to do recursive checks into the operands, but that is already being done again in identifyNode. Can we just remove them and rely on identifyNode? dmgreen: These seem to do recursive checks into the operands, but that is already being done again in…
				}
				}

				// Try match found pairs
				for (unsigned i = 0; i < Pairs.size(); i += 2) {
				auto *I = Pairs[i];
				auto *J = Pairs[i + 1];

				if ((I->getOpcode() == Instruction::FMul &&
				J->getOpcode() == Instruction::FMul)) {

				// At this point, all operands should be instructions
				if (!isa<Instruction>(I->getOperand(0)) \|\|
				!isa<Instruction>(I->getOperand(1)))
				continue;
				dmgreenUnsubmitted Done Reply Inline Actions I feel that this should be correct by construction. What cases are not correct? dmgreen: I feel that this should be correct by construction. What cases are not correct?
				NickGuyAuthorUnsubmitted Done Reply Inline Actions The one that is easiest to see is when attempting to multiply by a value rotated by 270 (`a[i] * (b[i] * III)`), it produces the following IR which has the real and imaginary components reversed at %interleaved.vec %a.real = shufflevector <8 x float> %a, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %a.imag = shufflevector <8 x float> %a, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %b.real = shufflevector <8 x float> %b, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %b.imag = shufflevector <8 x float> %b, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %0 = fmul fast <4 x float> %b.imag, %a.imag %1 = fmul fast <4 x float> %b.real, %a.real %2 = fsub fast <4 x float> %0, %1 %3 = fmul fast <4 x float> %b.imag, %a.real %4 = fmul fast <4 x float> %b.real, %a.imag %5 = fadd fast <4 x float> %4, %3 %interleaved.vec = shufflevector <4 x float> %5, <4 x float> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7> NickGuy: The one that is easiest to see is when attempting to multiply by a value rotated by 270 (`a[i]…
				dmgreenUnsubmitted Done Reply Inline Actions Sorry - I'm not sure what this was referring to now! Do you think that example should be transforming into complex intrinsics? It doesn't look valid to me. Have you added it as a test case? dmgreen: Sorry - I'm not sure what this was referring to now! Do you think that example should be…
				if (!isa<Instruction>(J->getOperand(0)) \|\|
				!isa<Instruction>(J->getOperand(1)))
				continue;

				auto *VTy = dyn_cast<FixedVectorType>(I->getType());
				if (!VTy)
				dmgreenUnsubmitted Done Reply Inline Actions Can we move this into identifyNodes (or maybe at this point there will always be nodes?) That would make replaceNodes always return true, which simplifies the getDeadRoots a little too. dmgreen: Can we move this into identifyNodes (or maybe at this point there will always be nodes?) That…
				continue;

				auto *NewVTy =
				FixedVectorType::get(VTy->getScalarType(), VTy->getNumElements() * 2);

				if (!TL->isComplexDeinterleavingOperationSupported(
				ComplexDeinterleavingOperation::CMulPartial, NewVTy))
				continue;

				// Partial mul
				auto CN = prepareCompositeNode(
				dmgreenUnsubmitted Done Reply Inline Actions It's best not to use static data like this, we can make it more generic. The match can be `m_Shuffle(..., m_Mask(Mask))`, then check that the Mask is an isDeinterleavingMask. It does need to check _which_ deinterleaving mask it is though for the Real/Imaginary parts. dmgreen: It's best not to use static data like this, we can make it more generic. The match can be…
				llvm::ComplexDeinterleavingOperation::CMulPartial);
				CN->addInstruction(I);
				CN->addInstruction(J);
				CN->OriginalInput0 = I;
				CN->OriginalInput1 = J;

				unsigned SharedIdx;
				auto *SharedOp =
				dyn_cast_or_null<Instruction>(getSharedOperand(I, J, SharedIdx));
				if (SharedOp) {
				auto Opc = SharedOp->getOpcode();
				if (Opc == Instruction::FNeg) {
				if (SharedIdx == 0)
				dmgreenUnsubmitted Done Reply Inline Actions Try not to overuse auto where the type isn't already obvious. dmgreen: Try not to overuse auto where the type isn't already obvious.
				CN->OriginalInput0 = SharedOp;
				else if (SharedIdx == 1)
				CN->OriginalInput1 = SharedOp;
				else {
				LLVM_DEBUG(dbgs() << "Unknown input pattern, somehow the shared "
				"operand index is greater than 1.\n");
				return false;
				}
				CN->addInstruction(SharedOp);
				CN->UsesNegation = true;
				}
				}

				auto Pattern = m_BinOp(m_Shuffle(m_Value(), m_Value()),
				m_Shuffle(m_Value(), m_Value()));
				dmgreenUnsubmitted Done Reply Inline Actions This isn't used any more, which is good (I think there might be a better way to get a TTI if it was needed). dmgreen: This isn't used any more, which is good (I think there might be a better way to get a TTI if it…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions Good catch, removed. NickGuy: Good catch, removed.
				CN->IsTopLevel = match(CN->OriginalInput0, Pattern) &&
				match(CN->OriginalInput1, Pattern);

				CN->OutputNode = J;
				submitCompositeNode(CN);
				continue;
				}

				if (((I->getOpcode() == Instruction::FSub &&
				J->getOpcode() == Instruction::FAdd) \|\|
				(I->getOpcode() == Instruction::FAdd &&
				dmgreenUnsubmitted Done Reply Inline Actions When can this return false? dmgreen: When can this return false?
				NickGuyAuthorUnsubmitted Done Reply Inline Actions It can be false if a shuffle hasn't been traversed properly. In hindsight though, this could be an assertion instead as it indicates a problem with the node construction. NickGuy: It can be false if a shuffle hasn't been traversed properly. In hindsight though, this could be…
				J->getOpcode() == Instruction::FSub))) {

				auto *VTy = dyn_cast<FixedVectorType>(I->getType());
				if (!VTy)
				continue;
				dmgreenUnsubmitted Done Reply Inline Actions I don't think this should be needed. The inputs should just be present from the ReplacementNode of the operands. dmgreen: I don't think this should be needed. The inputs should just be present from the ReplacementNode…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions It's more of a side effect of analysing the whole tree before actually performing the replacement (which we do in case the tree can't be transformed). With some of the recent changes however, this function can't return false so it's merely a step of the replacement now. NickGuy: It's more of a side effect of analysing the whole tree before actually performing the…

				auto *NewVTy =
				FixedVectorType::get(VTy->getScalarType(), VTy->getNumElements() * 2);

				if (!TL->isComplexDeinterleavingOperationSupported(
				ComplexDeinterleavingOperation::CAdd, NewVTy))
				continue;

				LLVM_DEBUG(dbgs() << "Pairing instructions as a CAdd.\n");
				auto CN = prepareCompositeNode(ComplexDeinterleavingOperation::CAdd);
				CN->addInstruction(I);
				CN->addInstruction(J);
				CN->OriginalInput0 = I;
				CN->OriginalInput1 = J;

				auto *Sub = I->getOpcode() == Instruction::FSub ? I : J;
				bool IsLikelyNegated = false;
				if (auto *Shuffle = dyn_cast<ShuffleVectorInst>(Sub->getOperand(0))) {
				dmgreenUnsubmitted Done Reply Inline Actions If the statistic is being awkward then it is probably not worth keeping (or keeping simple - just counting number of transforms, not the number of individual intrinsics that might become in the backend). dmgreen: If the statistic is being awkward then it is probably not worth keeping (or keeping simple…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions Changed to track the transforms rather than individual intrinsics. NickGuy: Changed to track the transforms rather than individual intrinsics.
				auto ShuffleMask = Shuffle->getShuffleMask();
				if (isDeinterleavingMask(ShuffleMask))
				IsLikelyNegated = ShuffleMask[0] == 1;
				}

				if (IsLikelyNegated) {
				LLVM_DEBUG(dbgs() << "Negated adds are not yet supported.\n");
				return false;
				}

				CN->UsesNegation = IsLikelyNegated;
				CN->Rotation = 90;
				if (I->getOpcode() == Instruction::FAdd)
				CN->Rotation = 270;
				CN->OutputNode = J;
				submitCompositeNode(CN);
				continue;
				}
				}

				auto UnmatchedInstructions = findUnmatchedInstructions();
				if (!UnmatchedInstructions.empty()) {
				LLVM_DEBUG(dbgs() << "Unmatched instructions found in graph, cannot "
				"confidently generate complex intrinsics.\n";);
				return false;
				}

				if (CompositeNodes.empty()) {
				LLVM_DEBUG(dbgs() << "No composite nodes found.\n");
				return false;
				}

				sortCompositeNodes(ConvergingI->getParent());

				for (auto *It = CompositeNodes.begin() + 1; It != CompositeNodes.end();
				It++) {
				auto CN = *It;
				auto PrevCN = *(It - 1);
				if (haveSharedUses(CN->OutputNode, PrevCN->OutputNode)) {
				CN->Accumulator = PrevCN->OutputNode;
				PrevCN->Accumulatee = CN->OutputNode;
				}
				}
				dmgreenUnsubmitted Done Reply Inline Actions If the Operands that CN depends on are included in the Node, then we can just walk up the tree making sure we create the Operands before the Nodes that use them, using the Value's that the operands produce as the Input0/Input1/Accumulator below. That avoids the need to "wrangle" any inputs at this stage, as we already know the nodes we need. dmgreen: If the Operands that CN depends on are included in the Node, then we can just walk up the tree…

				return true;
				}

				bool ComplexDeinterleavingGraph::replaceNodes(const TargetLowering *TL) {
				if (CompositeNodes.empty())
				return false;

				unsigned GeneratedIntrinsics = 0;
				auto *ConvergingI = Instructions[0];

				auto TTI = TL->getTargetMachine().getTargetTransformInfo(
				*ConvergingI->getFunction());
				for (const auto &CN : CompositeNodes) {
				auto *N = cast<Instruction>(CN->OutputNode);
				dmgreenUnsubmitted Done Reply Inline Actions It might be simpler to remove anything Cost based from this revision, adding it back in if it is needed in followup patches. All the identification that happens in this patch should always be cheaper, as far as I understand. dmgreen: It might be simpler to remove anything Cost based from this revision, adding it back in if it…

				// Wrangle the inputs

				/// If the given value is part of a CompositeNode, and said node is part of
				/// an accumulator chain, return the accumulator. Otherwise, returns the
				/// "best fit" value (the ReplacementNode of a containing CompositeNode, or
				/// the value itself)
				auto FollowAccumulatorIfNecessary = [&](Value V) -> Value {
				auto *I = dyn_cast<Instruction>(V);
				if (!I)
				return V;

				auto CN = getContainingComposite(I);
				if (!CN)
				return I;

				if (CN->Accumulatee)
				CN = getContainingComposite(cast<Instruction>(CN->Accumulatee));

				return CN->ReplacementNode;
				};

				/// Given a value and an operand index, get said operand and return it.
				/// If the discovered operand is part of a composite node, return the
				/// replacement instead.
				auto GetInputFromOriginalInput = [&](Value *OriginalInput,
				unsigned OpIdx) -> Value * {
				auto *OriginalI = cast<Instruction>(OriginalInput);
				if (OriginalI->getOpcode() == Instruction::FNeg)
				OpIdx = 0;

				auto *Op = OriginalI->getOperand(OpIdx);
				if (auto *SVI = dyn_cast<ShuffleVectorInst>(Op))
				Op = SVI->getOperand(0);

				if (!Op)
				return nullptr;

				if (auto *I = dyn_cast<Instruction>(Op)) {
				if (auto Containing = getContainingComposite(I)) {
				if (Containing->ReplacementNode)
				return Containing->ReplacementNode;
				}
				}
				return Op;
				};

				if (CN->Operation == llvm::ComplexDeinterleavingOperation::CAdd) {
				Value *Sub = nullptr;
				if (auto *Op0 = dyn_cast<Instruction>(CN->OriginalInput0)) {
				if (Op0->getOpcode() == Instruction::FSub)
				Sub = Op0;
				}
				if (!Sub) {
				if (auto *Op1 = dyn_cast<Instruction>(CN->OriginalInput1)) {
				if (Op1->getOpcode() == Instruction::FSub)
				Sub = Op1;
				}
				}

				if (!Sub)
				return false;

				CN->Input0 =
				FollowAccumulatorIfNecessary(GetInputFromOriginalInput(Sub, 0));
				CN->Input1 =
				FollowAccumulatorIfNecessary(GetInputFromOriginalInput(Sub, 1));
				} else {
				CN->Input0 = FollowAccumulatorIfNecessary(
				GetInputFromOriginalInput(CN->OriginalInput0, 0));
				CN->Input1 = FollowAccumulatorIfNecessary(
				GetInputFromOriginalInput(CN->OriginalInput1, 0));

				if (CN->OriginalInput0 != CN->OriginalInput1 && CN->Input0 == CN->Input1)
				CN->Input1 = FollowAccumulatorIfNecessary(
				GetInputFromOriginalInput(CN->OriginalInput1, 1));
				}

				if (CN->Input0 == nullptr \|\| CN->Input1 == nullptr) {
				LLVM_DEBUG(dbgs() << "Couldn't find inputs. Skipping...\n");
				continue;
				}

				if (CN->Accumulator) {
				if (auto Node =
				getContainingComposite(cast<Instruction>(CN->Accumulator)))
				CN->Accumulator = cast<Instruction>(Node->ReplacementNode);
				}

				if (CN->Operation == llvm::ComplexDeinterleavingOperation::CMulPartial &&
				CN->Accumulator) {
				if (auto Node =
				getContainingComposite(cast<Instruction>(CN->Accumulator))) {
				bool Valid90 = (Node->Rotation == 0 && CN->Rotation == 90) \|\|
				(Node->Rotation == 90 && CN->Rotation == 0);
				bool Valid270 = (Node->Rotation == 180 && CN->Rotation == 270) \|\|
				(Node->Rotation == 270 && CN->Rotation == 180);
				if (!Valid90 && !Valid270) {
				LLVM_DEBUG(dbgs() << "Invalid rotation pairs.\n");
				return false;
				}

				CN->Input0 = Node->Input0;
				CN->Input1 = Node->Input1;
				}
				}
				if (CN->Operation == llvm::ComplexDeinterleavingOperation::CAdd) {
				Instruction *FAdd = cast<Instruction>(CN->OriginalInput0);
				if (FAdd->getOpcode() != Instruction::FAdd)
				FAdd = cast<Instruction>(CN->OriginalInput1);
				auto RightComponent = getOperatingComponentOfValue(FAdd->getOperand(1));

				if (RightComponent != OperatingComponent::Real) {
				LLVM_DEBUG(dbgs() << "CAdd.FAdd[1] should be the real component.\n");
				return false;
				}
				}

				CN->ReplacementNode = TL->createComplexDeinterleavingIR(
				N, CN->Operation, CN->Rotation, CN->Input0, CN->Input1,
				CN->Accumulator);
				if (!CN->ReplacementNode) {
				LLVM_DEBUG(dbgs() << "Target failed to create Intrinsic call.\n");
				dmgreenUnsubmitted Done Reply Inline Actions It is better to structure things in a way where we decide whether to do something, then do it. Not get half way through doing it and decide we didn't want to in the end. In what ways would we expect createComplexDeinterleavingIR to return nullptr at the moment? dmgreen: It is better to structure things in a way where we decide whether to do something, then do it.
				return false;
				}

				cast<Instruction>(CN->ReplacementNode)
				->moveAfter(cast<Instruction>(CN->OutputNode));

				CostOfIntrinsics += TTI.getInstructionCost(
				cast<Instruction>(CN->ReplacementNode), CostKind);
				GeneratedIntrinsics += 1;
				}

				auto *R = getFinalInputReplacement(ConvergingI);
				if (!R) {
				LLVM_DEBUG(dbgs() << "Unable to find Final Input Replacement.\n");
				return false;
				}

				InstructionCost CostOfNodes;
				for (const auto &I : Instructions)
				CostOfNodes += TTI.getInstructionCost(I, CostKind);

				LLVM_DEBUG(dbgs() << "Evaluating cost of each graph. Instructions: "
				<< CostOfNodes << ", Intrinsics: " << CostOfIntrinsics
				<< ".\n");
				if (CostOfIntrinsics > CostOfNodes) {
				dmgreenUnsubmitted Done Reply Inline Actions Is this cost necessary at the moment, or will it always be profitable for the simple cases? dmgreen: Is this cost necessary at the moment, or will it always be profitable for the simple cases?
				LLVM_DEBUG(dbgs() << "Not replacing, cost was too high.\n");
				return false;
				}

				cast<Instruction>(R)->getParent()->dump();
				ConvergingI->replaceAllUsesWith(R);

				NumComplexIntrinsics += GeneratedIntrinsics;

				return true;
				}

				void ComplexDeinterleavingGraph::getDeadRoots(
				SmallVector<Instruction *> &DeadInstrRoots) {
				for (const auto &CN : CompositeNodes) {
				if (auto *I = dyn_cast_or_null<Instruction>(CN->ReplacementNode))
				DeadInstrRoots.push_back(I);
				}
				}

llvm/lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 734 Lines • ▼ Show 20 Lines	public:

bool shouldFoldConstantShiftPairToMask(const SDNode *N,		bool shouldFoldConstantShiftPairToMask(const SDNode *N,
CombineLevel Level) const override;		CombineLevel Level) const override;

bool preferIncOfAddToSubOfNot(EVT VT) const override;		bool preferIncOfAddToSubOfNot(EVT VT) const override;

bool shouldConvertFpToSat(unsigned Op, EVT FPVT, EVT VT) const override;		bool shouldConvertFpToSat(unsigned Op, EVT FPVT, EVT VT) const override;

		bool isComplexDeinterleavingSupported() const override;
		bool isComplexDeinterleavingOperationSupported(
		ComplexDeinterleavingOperation Operation, Type *Ty) const override;

		Value *createComplexDeinterleavingIR(
		Instruction *I, ComplexDeinterleavingOperation OperationType,
		unsigned Rotation, Value InputA, Value InputB,
		Value *Accumulator = nullptr) const override;

protected:		protected:
std::pair<const TargetRegisterClass *, uint8_t>		std::pair<const TargetRegisterClass *, uint8_t>
findRepresentativeClass(const TargetRegisterInfo *TRI,		findRepresentativeClass(const TargetRegisterInfo *TRI,
MVT VT) const override;		MVT VT) const override;

private:		private:
/// Subtarget - Keep a pointer to the ARMSubtarget around so that we can		/// Subtarget - Keep a pointer to the ARMSubtarget around so that we can
/// make the right decision when generating code for different targets.		/// make the right decision when generating code for different targets.
▲ Show 20 Lines • Show All 244 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 21,721 Lines • ▼ Show 20 Lines	for (auto *Exit : Exits)
.addReg(NewVR);		.addReg(NewVR);
}		}
}		}

void ARMTargetLowering::finalizeLowering(MachineFunction &MF) const {		void ARMTargetLowering::finalizeLowering(MachineFunction &MF) const {
MF.getFrameInfo().computeMaxCallFrameSize(MF);		MF.getFrameInfo().computeMaxCallFrameSize(MF);
TargetLoweringBase::finalizeLowering(MF);		TargetLoweringBase::finalizeLowering(MF);
}		}

		bool ARMTargetLowering::isComplexDeinterleavingSupported() const {
		return Subtarget->hasMVEFloatOps();
		}

		bool ARMTargetLowering::isComplexDeinterleavingOperationSupported(
		ComplexDeinterleavingOperation Operation, Type *Ty) const {
		auto *VTy = dyn_cast<FixedVectorType>(Ty);
		if (!VTy)
		return false;

		if (VTy->getNumElements() * VTy->getScalarSizeInBits() != 128)
		return false;
		dmgreenUnsubmitted Done Reply Inline Actions Things like this can be a single if: if (!VTy \|\| VTy->getNumElements() * VTy->getScalarSizeInBits() != 128) return false; I presume the == 128 can be removed if we teach it how to split the vectors up? dmgreen: Things like this can be a single if: ``` if (!VTy \|\| VTy->getNumElements() * VTy…

		// Both VCADD and VCMUL/VCMLA support the same types, F16 and F32
		auto *ScalarTy = VTy->getScalarType();
		if (ScalarTy->isHalfTy() \|\| ScalarTy->isFloatTy())
		return true;

		return false;
		}

		Value *ARMTargetLowering::createComplexDeinterleavingIR(
		Instruction *I, ComplexDeinterleavingOperation OperationType,
		unsigned Rotation, Value InputA, Value InputB, Value *Accumulator) const {
		dmgreenUnsubmitted Done Reply Inline Actions Formatting. dmgreen: Formatting.

		IRBuilder<> B(I);
		auto *IntTy = Type::getInt32Ty(B.getContext());
		auto *Ty = InputA->getType();

		if (OperationType == ComplexDeinterleavingOperation::CMulPartial) {

		ConstantInt *ConstMulRot = nullptr;
		dmgreenUnsubmitted Done Reply Inline Actions Add a message to the assert dmgreen: Add a message to the assert

		if (Rotation == 0)
		ConstMulRot = ConstantInt::get(IntTy, 0);
		else if (Rotation == 90)
		ConstMulRot = ConstantInt::get(IntTy, 1);
		else if (Rotation == 180)
		ConstMulRot = ConstantInt::get(IntTy, 2);
		else if (Rotation == 270)
		ConstMulRot = ConstantInt::get(IntTy, 3);

		if (!ConstMulRot)
		dmgreenUnsubmitted Done Reply Inline Actions Can this be an assert instead. dmgreen: Can this be an assert instead.
		return nullptr;

		if (Accumulator)
		dmgreenUnsubmitted Done Reply Inline Actions It is better not to use fixed sized arrays like this, just construct the array to be any needed size. If the values are continuous you can use iota. dmgreen: It is better not to use fixed sized arrays like this, just construct the array to be any needed…
		return B.CreateIntrinsic(Intrinsic::arm_mve_vcmlaq, Ty,
		{ConstMulRot, Accumulator, InputB, InputA});
		return B.CreateIntrinsic(Intrinsic::arm_mve_vcmulq, Ty,
		{ConstMulRot, InputB, InputA});
		}

		if (OperationType == ComplexDeinterleavingOperation::CAdd) {

		// 1 means the value is not halved.
		unsigned HalvingVal = 1;
		auto *Halving = ConstantInt::get(IntTy, HalvingVal);

		unsigned RotKey;
		if (Rotation == 90)
		RotKey = 0;
		else if (Rotation == 270)
		RotKey = 1;
		else
		return nullptr; // Invalid rotation for arm_mve_vcaddq

		auto *RotVal = ConstantInt::get(IntTy, RotKey);
		return B.CreateIntrinsic(Intrinsic::arm_mve_vcaddq, Ty,
		{Halving, RotVal, InputA, InputB});
		}

		return nullptr;
		}

llvm/lib/Target/ARM/ARMTargetMachine.cpp

Show First 20 Lines • Show All 420 Lines • ▼ Show 20 Lines	addPass(createCFGSimplificationPass(
}));		}));

addPass(createMVEGatherScatterLoweringPass());		addPass(createMVEGatherScatterLoweringPass());
addPass(createMVELaneInterleavingPass());		addPass(createMVELaneInterleavingPass());

TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();

// Run the parallel DSP pass.		// Run the parallel DSP pass.
if (getOptLevel() == CodeGenOpt::Aggressive)		if (getOptLevel() == CodeGenOpt::Aggressive)
addPass(createARMParallelDSPPass());		addPass(createARMParallelDSPPass());

		// Match complex arithmetic patterns
		if (TM->getOptLevel() >= CodeGenOpt::Default)
		addPass(createComplexDeinterleavingPass(TM));

// Match interleaved memory accesses to ldN/stN intrinsics.		// Match interleaved memory accesses to ldN/stN intrinsics.
if (TM->getOptLevel() != CodeGenOpt::None)		if (TM->getOptLevel() != CodeGenOpt::None) {
		dmgreenUnsubmitted Done Reply Inline Actions This doesn't need to add the new brackets. dmgreen: This doesn't need to add the new brackets.
		dmgreenUnsubmitted Done Reply Inline Actions This again. dmgreen: This again.
addPass(createInterleavedAccessPass());		addPass(createInterleavedAccessPass());
		}

// Add Control Flow Guard checks.		// Add Control Flow Guard checks.
if (TM->getTargetTriple().isOSWindows())		if (TM->getTargetTriple().isOSWindows())
addPass(createCFGuardCheckPass());		addPass(createCFGuardCheckPass());

if (TM->Options.JMCInstrument)		if (TM->Options.JMCInstrument)
addPass(createJMCInstrumenterPass());		addPass(createJMCInstrumenterPass());
}		}
▲ Show 20 Lines • Show All 164 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s --mattr=+mve.fp -o - \| FileCheck %s

				dmgreenUnsubmitted Done Reply Inline Actions I would use run lines that that are similar to the llvm/test/CodeGen/Thumb2/mve-xyz.ll tests. Something like: ; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs %s -o - \| FileCheck %s It's best not to use a specific cpu, using the arch instead. The tests can probably go in the same place too, under llvm/test/CodeGen/Thumb2/mve-xyz.ll for mve tests. dmgreen: I would use run lines that that are similar to the llvm/test/CodeGen/Thumb2/mve-xyz.ll tests.
				dmgreenUnsubmitted Done Reply Inline Actions These tests seem to have moved back to the wrong place. dmgreen: These tests seem to have moved back to the wrong place.
				target triple = "thumbv8.1m.main-none-none-eabi"


				dmgreenUnsubmitted Done Reply Inline Actions This probably isn't worth testing if it is testing codegen already. dmgreen: This probably isn't worth testing if it is testing codegen already.

				define <2 x half> @complex_add_v2f16(<2 x half> %a, <2 x half> %b) #0 {
				; CHECK-LABEL: complex_add_v2f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: mov r2, sp
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: vldrw.u32 q1, [r2]
				; CHECK-NEXT: vmovx.f16 s2, s4
				; CHECK-NEXT: vadd.f16 s2, s2, s0
				; CHECK-NEXT: vmovx.f16 s0, s0
				; CHECK-NEXT: vsub.f16 s0, s4, s0
				; CHECK-NEXT: vins.f16 s0, s2
				; CHECK-NEXT: vmov r0, r1, d0
				; CHECK-NEXT: vmov r2, r3, d0
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <2 x half> %a, <2 x half> zeroinitializer, <1 x i32> <i32 0>
				%a.imag = shufflevector <2 x half> %a, <2 x half> zeroinitializer, <1 x i32> <i32 1>
				%b.real = shufflevector <2 x half> %b, <2 x half> zeroinitializer, <1 x i32> <i32 0>
				%b.imag = shufflevector <2 x half> %b, <2 x half> zeroinitializer, <1 x i32> <i32 1>
				%0 = fsub fast <1 x half> %b.real, %a.imag
				%1 = fadd fast <1 x half> %b.imag, %a.real
				%interleaved.vec = shufflevector <1 x half> %0, <1 x half> %1, <2 x i32> <i32 0, i32 1>
				ret <2 x half> %interleaved.vec
				}
				define <4 x half> @complex_add_v4f16(<4 x half> %a, <4 x half> %b) #0 {
				; CHECK-LABEL: complex_add_v4f16:
				; CHECK: @ %bb.0: @ %entry
				dmgreenUnsubmitted Done Reply Inline Actions A lot of these tests are strange - they seem to have infinite loops? I think you should be able to remove most of it to make a much cleaner test. It doesn't even need the loads/stores: define <8 x half> @complex_add_v8f16(<8 x half> %a, <8 x half> %b) { entry: %strided.vec = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %strided.vec21 = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %strided.vec23 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %strided.vec24 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %0 = fsub fast <4 x half> %strided.vec23, %strided.vec21 %1 = fadd fast <4 x half> %strided.vec24, %strided.vec %interleaved.vec = shufflevector <4 x half> %0, <4 x half> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7> ret <8 x half> %interleaved.vec } dmgreen: A lot of these tests are strange - they seem to have infinite loops? I think you should be…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions These were initially generated from a larger IR, and pushed through llvm-reduce. I've rewritten them manually to be much smaller (basing them off of your snippet). NickGuy: These were initially generated from a larger IR, and pushed through llvm-reduce. I've rewritten…
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: mov r0, sp
				; CHECK-NEXT: vldrw.u32 q2, [r0]
				; CHECK-NEXT: vmovx.f16 s4, s0
				; CHECK-NEXT: vmovx.f16 s2, s1
				; CHECK-NEXT: vins.f16 s0, s1
				; CHECK-NEXT: vins.f16 s4, s2
				; CHECK-NEXT: vmovx.f16 s12, s8
				; CHECK-NEXT: vmovx.f16 s2, s9
				; CHECK-NEXT: vins.f16 s8, s9
				; CHECK-NEXT: vins.f16 s12, s2
				; CHECK-NEXT: vsub.f16 q1, q2, q1
				; CHECK-NEXT: vadd.f16 q0, q3, q0
				; CHECK-NEXT: vmovx.f16 s5, s4
				; CHECK-NEXT: vins.f16 s4, s0
				; CHECK-NEXT: vmovx.f16 s0, s0
				; CHECK-NEXT: vins.f16 s5, s0
				; CHECK-NEXT: vmov r2, r3, d0
				; CHECK-NEXT: vmov r0, r1, d2
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <4 x half> %a, <4 x half> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%a.imag = shufflevector <4 x half> %a, <4 x half> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%b.real = shufflevector <4 x half> %b, <4 x half> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%b.imag = shufflevector <4 x half> %b, <4 x half> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%0 = fsub fast <2 x half> %b.real, %a.imag
				%1 = fadd fast <2 x half> %b.imag, %a.real
				%interleaved.vec = shufflevector <2 x half> %0, <2 x half> %1, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x half> %interleaved.vec
				}
				define <8 x half> @complex_add_v8f16(<8 x half> %a, <8 x half> %b) #0 {
				; CHECK-LABEL: complex_add_v8f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: mov r0, sp
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vmov d1, r2, r3
				; CHECK-NEXT: vcadd.f16 q0, q1, q0, #90
				; CHECK-NEXT: vmov r0, r1, d0
				; CHECK-NEXT: vmov r2, r3, d1
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <8 x half> %a, <8 x half> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%a.imag = shufflevector <8 x half> %a, <8 x half> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%b.real = shufflevector <8 x half> %b, <8 x half> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%b.imag = shufflevector <8 x half> %b, <8 x half> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fsub fast <4 x half> %b.real, %a.imag
				%1 = fadd fast <4 x half> %b.imag, %a.real
				%interleaved.vec = shufflevector <4 x half> %0, <4 x half> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				ret <8 x half> %interleaved.vec
				}
				define <16 x half> @complex_add_v16f16(<16 x half> %a, <16 x half> %b) #0 {
				; CHECK-LABEL: complex_add_v16f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .vsave {d8, d9, d10, d11}
				; CHECK-NEXT: vpush {d8, d9, d10, d11}
				; CHECK-NEXT: vmov d2, r2, r3
				; CHECK-NEXT: vldr d3, [sp, #32]
				; CHECK-NEXT: add r1, sp, #40
				; CHECK-NEXT: vmovx.f16 s0, s4
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vmovx.f16 s2, s5
				; CHECK-NEXT: vins.f16 s0, s2
				; CHECK-NEXT: vmovx.f16 s1, s6
				; CHECK-NEXT: vmovx.f16 s2, s7
				; CHECK-NEXT: vmovx.f16 s12, s9
				; CHECK-NEXT: vins.f16 s1, s2
				; CHECK-NEXT: vmovx.f16 s2, s8
				; CHECK-NEXT: vins.f16 s2, s12
				; CHECK-NEXT: vmovx.f16 s3, s10
				; CHECK-NEXT: vmovx.f16 s12, s11
				; CHECK-NEXT: add r1, sp, #56
				; CHECK-NEXT: vins.f16 s3, s12
				; CHECK-NEXT: vldrw.u32 q3, [r1]
				; CHECK-NEXT: add r1, sp, #72
				; CHECK-NEXT: vins.f16 s4, s5
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vmovx.f16 s16, s12
				; CHECK-NEXT: vmovx.f16 s5, s13
				; CHECK-NEXT: vmovx.f16 s17, s14
				; CHECK-NEXT: vins.f16 s16, s5
				; CHECK-NEXT: vmovx.f16 s5, s15
				; CHECK-NEXT: vins.f16 s17, s5
				; CHECK-NEXT: vmovx.f16 s18, s20
				; CHECK-NEXT: vmovx.f16 s5, s21
				; CHECK-NEXT: vmovx.f16 s19, s22
				; CHECK-NEXT: vins.f16 s18, s5
				; CHECK-NEXT: vmovx.f16 s5, s23
				; CHECK-NEXT: vins.f16 s6, s7
				; CHECK-NEXT: vins.f16 s10, s11
				; CHECK-NEXT: vins.f16 s14, s15
				; CHECK-NEXT: vins.f16 s22, s23
				; CHECK-NEXT: vins.f16 s20, s21
				; CHECK-NEXT: vins.f16 s8, s9
				; CHECK-NEXT: vins.f16 s19, s5
				; CHECK-NEXT: vins.f16 s12, s13
				; CHECK-NEXT: vmov.f32 s13, s14
				; CHECK-NEXT: vmov.f32 s5, s6
				; CHECK-NEXT: vmov.f32 s14, s20
				; CHECK-NEXT: vmov.f32 s15, s22
				; CHECK-NEXT: vmov.f32 s6, s8
				; CHECK-NEXT: vsub.f16 q0, q3, q0
				; CHECK-NEXT: vmov.f32 s7, s10
				; CHECK-NEXT: vmovx.f16 s12, s0
				; CHECK-NEXT: vadd.f16 q1, q4, q1
				; CHECK-NEXT: vmovx.f16 s9, s2
				; CHECK-NEXT: vins.f16 s0, s4
				; CHECK-NEXT: vmovx.f16 s4, s4
				; CHECK-NEXT: vins.f16 s2, s6
				; CHECK-NEXT: vmovx.f16 s6, s6
				; CHECK-NEXT: vins.f16 s12, s4
				; CHECK-NEXT: vmovx.f16 s4, s1
				; CHECK-NEXT: vmovx.f16 s8, s5
				; CHECK-NEXT: vins.f16 s9, s6
				; CHECK-NEXT: vmovx.f16 s11, s3
				; CHECK-NEXT: vins.f16 s3, s7
				; CHECK-NEXT: vmovx.f16 s6, s7
				; CHECK-NEXT: vins.f16 s4, s8
				; CHECK-NEXT: vins.f16 s1, s5
				; CHECK-NEXT: vins.f16 s11, s6
				; CHECK-NEXT: vmov.f32 s8, s2
				; CHECK-NEXT: vmov.f32 s10, s3
				; CHECK-NEXT: vstrw.32 q2, [r0, #16]
				; CHECK-NEXT: vmov q2, q0
				; CHECK-NEXT: vmov.f32 s9, s12
				; CHECK-NEXT: vmov.f32 s10, s1
				; CHECK-NEXT: vmov.f32 s11, s4
				; CHECK-NEXT: vstrw.32 q2, [r0]
				; CHECK-NEXT: vpop {d8, d9, d10, d11}
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <16 x half> %a, <16 x half> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%a.imag = shufflevector <16 x half> %a, <16 x half> zeroinitializer, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%b.real = shufflevector <16 x half> %b, <16 x half> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%b.imag = shufflevector <16 x half> %b, <16 x half> zeroinitializer, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fsub fast <8 x half> %b.real, %a.imag
				%1 = fadd fast <8 x half> %b.imag, %a.real
				%interleaved.vec = shufflevector <8 x half> %0, <8 x half> %1, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				ret <16 x half> %interleaved.vec
				}
				define <32 x half> @complex_add_v32f16(<32 x half> %a, <32 x half> %b) #0 {
				; CHECK-LABEL: complex_add_v32f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: .pad #8
				; CHECK-NEXT: sub sp, #8
				; CHECK-NEXT: vmov d2, r2, r3
				; CHECK-NEXT: vldr d3, [sp, #72]
				; CHECK-NEXT: add r1, sp, #128
				; CHECK-NEXT: vmovx.f16 s0, s4
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vmovx.f16 s2, s5
				; CHECK-NEXT: vmovx.f16 s1, s6
				; CHECK-NEXT: vins.f16 s0, s2
				; CHECK-NEXT: vmovx.f16 s2, s7
				; CHECK-NEXT: add r1, sp, #144
				; CHECK-NEXT: vldrw.u32 q4, [r1]
				; CHECK-NEXT: vins.f16 s1, s2
				; CHECK-NEXT: vmovx.f16 s2, s21
				; CHECK-NEXT: vmovx.f16 s12, s20
				; CHECK-NEXT: vmov.f32 s8, s20
				; CHECK-NEXT: vins.f16 s12, s2
				; CHECK-NEXT: vmov.f32 s9, s22
				; CHECK-NEXT: vmovx.f16 s2, s23
				; CHECK-NEXT: vmovx.f16 s13, s22
				; CHECK-NEXT: add r1, sp, #80
				; CHECK-NEXT: vmovx.f16 s14, s16
				; CHECK-NEXT: vins.f16 s13, s2
				; CHECK-NEXT: vmovx.f16 s2, s17
				; CHECK-NEXT: vins.f16 s8, s21
				; CHECK-NEXT: vins.f16 s9, s23
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vmovx.f16 s15, s18
				; CHECK-NEXT: vins.f16 s14, s2
				; CHECK-NEXT: vmovx.f16 s2, s19
				; CHECK-NEXT: vmovx.f16 s10, s21
				; CHECK-NEXT: vins.f16 s15, s2
				; CHECK-NEXT: vmovx.f16 s2, s20
				; CHECK-NEXT: vins.f16 s18, s19
				; CHECK-NEXT: vmovx.f16 s3, s22
				; CHECK-NEXT: vins.f16 s2, s10
				; CHECK-NEXT: vmovx.f16 s10, s23
				; CHECK-NEXT: vins.f16 s6, s7
				; CHECK-NEXT: vins.f16 s22, s23
				; CHECK-NEXT: vins.f16 s16, s17
				; CHECK-NEXT: vins.f16 s20, s21
				; CHECK-NEXT: vins.f16 s4, s5
				; CHECK-NEXT: vins.f16 s3, s10
				; CHECK-NEXT: vmov.f32 s5, s6
				; CHECK-NEXT: add r1, sp, #96
				; CHECK-NEXT: vmov.f32 s10, s16
				; CHECK-NEXT: vmov.f32 s11, s18
				; CHECK-NEXT: vldrw.u32 q4, [r1]
				; CHECK-NEXT: vmov.f32 s6, s20
				; CHECK-NEXT: vsub.f16 q0, q2, q0
				; CHECK-NEXT: vmov.f32 s7, s22
				; CHECK-NEXT: vmovx.f16 s8, s0
				; CHECK-NEXT: vadd.f16 q1, q3, q1
				; CHECK-NEXT: vmovx.f16 s9, s2
				; CHECK-NEXT: vins.f16 s0, s4
				; CHECK-NEXT: vmovx.f16 s4, s4
				; CHECK-NEXT: vins.f16 s8, s4
				; CHECK-NEXT: vmovx.f16 s4, s5
				; CHECK-NEXT: vstr s8, [sp, #4] @ 4-byte Spill
				; CHECK-NEXT: vmovx.f16 s8, s1
				; CHECK-NEXT: vins.f16 s8, s4
				; CHECK-NEXT: vmovx.f16 s4, s6
				; CHECK-NEXT: vins.f16 s9, s4
				; CHECK-NEXT: vmovx.f16 s11, s3
				; CHECK-NEXT: vmovx.f16 s4, s7
				; CHECK-NEXT: vmovx.f16 s12, s16
				; CHECK-NEXT: vins.f16 s11, s4
				; CHECK-NEXT: vmovx.f16 s4, s17
				; CHECK-NEXT: add r1, sp, #160
				; CHECK-NEXT: vins.f16 s12, s4
				; CHECK-NEXT: vldrw.u32 q7, [r1]
				; CHECK-NEXT: vmovx.f16 s13, s18
				; CHECK-NEXT: vmovx.f16 s4, s19
				; CHECK-NEXT: add r1, sp, #112
				; CHECK-NEXT: vins.f16 s3, s7
				; CHECK-NEXT: vins.f16 s1, s5
				; CHECK-NEXT: vins.f16 s2, s6
				; CHECK-NEXT: vins.f16 s13, s4
				; CHECK-NEXT: vldrw.u32 q1, [r1]
				; CHECK-NEXT: vstr s8, [sp] @ 4-byte Spill
				; CHECK-NEXT: add r1, sp, #176
				; CHECK-NEXT: vmov.f32 s20, s28
				; CHECK-NEXT: vmovx.f16 s14, s4
				; CHECK-NEXT: vmovx.f16 s8, s5
				; CHECK-NEXT: vins.f16 s14, s8
				; CHECK-NEXT: vldrw.u32 q6, [r1]
				; CHECK-NEXT: vmovx.f16 s15, s6
				; CHECK-NEXT: vmovx.f16 s8, s7
				; CHECK-NEXT: vins.f16 s18, s19
				; CHECK-NEXT: vins.f16 s4, s5
				; CHECK-NEXT: vins.f16 s15, s8
				; CHECK-NEXT: vins.f16 s16, s17
				; CHECK-NEXT: vmov.f32 s17, s18
				; CHECK-NEXT: vins.f16 s20, s29
				; CHECK-NEXT: vmovx.f16 s8, s29
				; CHECK-NEXT: vmov.f32 s18, s4
				; CHECK-NEXT: vmovx.f16 s4, s31
				; CHECK-NEXT: vmovx.f16 s29, s30
				; CHECK-NEXT: vmov.f32 s21, s30
				; CHECK-NEXT: vins.f16 s29, s4
				; CHECK-NEXT: vmovx.f16 s4, s25
				; CHECK-NEXT: vmovx.f16 s30, s24
				; CHECK-NEXT: vmov.f32 s22, s24
				; CHECK-NEXT: vmovx.f16 s28, s28
				; CHECK-NEXT: vins.f16 s21, s31
				; CHECK-NEXT: vins.f16 s6, s7
				; CHECK-NEXT: vins.f16 s30, s4
				; CHECK-NEXT: vmovx.f16 s31, s26
				; CHECK-NEXT: vmovx.f16 s4, s27
				; CHECK-NEXT: vins.f16 s26, s27
				; CHECK-NEXT: vins.f16 s22, s25
				; CHECK-NEXT: vmov.f32 s23, s26
				; CHECK-NEXT: vins.f16 s28, s8
				; CHECK-NEXT: vmov.f32 s19, s6
				; CHECK-NEXT: vins.f16 s31, s4
				; CHECK-NEXT: vsub.f16 q5, q5, q3
				; CHECK-NEXT: vadd.f16 q4, q7, q4
				; CHECK-NEXT: vmovx.f16 s13, s22
				; CHECK-NEXT: vmovx.f16 s4, s18
				; CHECK-NEXT: vmovx.f16 s15, s23
				; CHECK-NEXT: vins.f16 s13, s4
				; CHECK-NEXT: vmovx.f16 s4, s19
				; CHECK-NEXT: vins.f16 s15, s4
				; CHECK-NEXT: vmovx.f16 s4, s20
				; CHECK-NEXT: vmovx.f16 s6, s16
				; CHECK-NEXT: vins.f16 s22, s18
				; CHECK-NEXT: vins.f16 s23, s19
				; CHECK-NEXT: vins.f16 s4, s6
				; CHECK-NEXT: vmovx.f16 s6, s21
				; CHECK-NEXT: vmovx.f16 s8, s17
				; CHECK-NEXT: vins.f16 s20, s16
				; CHECK-NEXT: vins.f16 s21, s17
				; CHECK-NEXT: vmov q4, q5
				; CHECK-NEXT: vins.f16 s6, s8
				; CHECK-NEXT: vmov.f32 s12, s22
				; CHECK-NEXT: vmov.f32 s14, s23
				; CHECK-NEXT: vmov.f32 s17, s4
				; CHECK-NEXT: vstrw.32 q3, [r0, #48]
				; CHECK-NEXT: vmov.f32 s19, s6
				; CHECK-NEXT: vmov q1, q0
				; CHECK-NEXT: vmov.f32 s18, s21
				; CHECK-NEXT: vmov.f32 s8, s2
				; CHECK-NEXT: vstrw.32 q4, [r0, #32]
				; CHECK-NEXT: vmov.f32 s10, s3
				; CHECK-NEXT: vstrw.32 q2, [r0, #16]
				; CHECK-NEXT: vmov.f32 s6, s1
				; CHECK-NEXT: vldr s5, [sp, #4] @ 4-byte Reload
				; CHECK-NEXT: vldr s7, [sp] @ 4-byte Reload
				; CHECK-NEXT: vstrw.32 q1, [r0]
				; CHECK-NEXT: add sp, #8
				; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <32 x half> %a, <32 x half> zeroinitializer, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%a.imag = shufflevector <32 x half> %a, <32 x half> zeroinitializer, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%b.real = shufflevector <32 x half> %b, <32 x half> zeroinitializer, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%b.imag = shufflevector <32 x half> %b, <32 x half> zeroinitializer, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%0 = fsub fast <16 x half> %b.real, %a.imag
				%1 = fadd fast <16 x half> %b.imag, %a.real
				%interleaved.vec = shufflevector <16 x half> %0, <16 x half> %1, <32 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
				ret <32 x half> %interleaved.vec
				}

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-mul.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s --mattr=+mve.fp -o - \| FileCheck %s

				target triple = "thumbv8.1m.main-none-none-eabi"

				define <2 x half> @complex_mul_v2f16(<2 x half> %a, <2 x half> %b) #0 {
				; CHECK-LABEL: complex_mul_v2f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: mov r2, sp
				; CHECK-NEXT: vmov d2, r0, r1
				; CHECK-NEXT: vldrw.u32 q0, [r2]
				; CHECK-NEXT: vmovx.f16 s8, s4
				; CHECK-NEXT: vmovx.f16 s2, s0
				; CHECK-NEXT: vmov r2, r3, d0
				; CHECK-NEXT: vmul.f16 s6, s2, s4
				; CHECK-NEXT: vfma.f16 s6, s0, s8
				; CHECK-NEXT: vmul.f16 s8, s8, s2
				; CHECK-NEXT: vfnms.f16 s8, s0, s4
				; CHECK-NEXT: vins.f16 s8, s6
				; CHECK-NEXT: vmov r0, r1, d4
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <2 x half> %a, <2 x half> poison, <1 x i32> <i32 0>
				%a.imag = shufflevector <2 x half> %a, <2 x half> poison, <1 x i32> <i32 1>
				%b.real = shufflevector <2 x half> %b, <2 x half> poison, <1 x i32> <i32 0>
				%b.imag = shufflevector <2 x half> %b, <2 x half> poison, <1 x i32> <i32 1>
				%0 = fmul fast <1 x half> %b.imag, %a.real
				%1 = fmul fast <1 x half> %b.real, %a.imag
				%2 = fadd fast <1 x half> %1, %0
				%3 = fmul fast <1 x half> %b.real, %a.real
				%4 = fmul fast <1 x half> %a.imag, %b.imag
				%5 = fsub fast <1 x half> %3, %4
				%interleaved.vec = shufflevector <1 x half> %5, <1 x half> %2, <2 x i32> <i32 0, i32 1>
				ret <2 x half> %interleaved.vec
				}

				define <4 x half> @complex_mul_v4f16(<4 x half> %a, <4 x half> %b) #0 {
				; CHECK-LABEL: complex_mul_v4f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .vsave {d8, d9}
				; CHECK-NEXT: vpush {d8, d9}
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: add r0, sp, #16
				; CHECK-NEXT: vldrw.u32 q2, [r0]
				; CHECK-NEXT: vmovx.f16 s4, s0
				; CHECK-NEXT: vmovx.f16 s2, s1
				; CHECK-NEXT: vins.f16 s0, s1
				; CHECK-NEXT: vins.f16 s4, s2
				; CHECK-NEXT: vmovx.f16 s12, s8
				; CHECK-NEXT: vmovx.f16 s2, s9
				; CHECK-NEXT: vins.f16 s8, s9
				; CHECK-NEXT: vins.f16 s12, s2
				; CHECK-NEXT: vmul.f16 q4, q3, q0
				; CHECK-NEXT: vfma.f16 q4, q2, q1
				; CHECK-NEXT: vmul.f16 q1, q1, q3
				; CHECK-NEXT: vneg.f16 q1, q1
				; CHECK-NEXT: vfma.f16 q1, q2, q0
				; CHECK-NEXT: vmovx.f16 s0, s16
				; CHECK-NEXT: vmovx.f16 s5, s4
				; CHECK-NEXT: vins.f16 s4, s16
				; CHECK-NEXT: vins.f16 s5, s0
				; CHECK-NEXT: vmov r2, r3, d0
				; CHECK-NEXT: vmov r0, r1, d2
				; CHECK-NEXT: vpop {d8, d9}
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <4 x half> %a, <4 x half> poison, <2 x i32> <i32 0, i32 2>
				%a.imag = shufflevector <4 x half> %a, <4 x half> poison, <2 x i32> <i32 1, i32 3>
				%b.real = shufflevector <4 x half> %b, <4 x half> poison, <2 x i32> <i32 0, i32 2>
				%b.imag = shufflevector <4 x half> %b, <4 x half> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x half> %b.imag, %a.real
				%1 = fmul fast <2 x half> %b.real, %a.imag
				%2 = fadd fast <2 x half> %1, %0
				%3 = fmul fast <2 x half> %b.real, %a.real
				%4 = fmul fast <2 x half> %a.imag, %b.imag
				%5 = fsub fast <2 x half> %3, %4
				%interleaved.vec = shufflevector <2 x half> %5, <2 x half> %2, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x half> %interleaved.vec
				}

				define <8 x half> @complex_mul_v8f16(<8 x half> %a, <8 x half> %b) #0 {
				; CHECK-LABEL: complex_mul_v8f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: mov r0, sp
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vmov d1, r2, r3
				; CHECK-NEXT: vcmul.f16 q2, q0, q1, #90
				; CHECK-NEXT: vcmla.f16 q2, q0, q1, #0
				; CHECK-NEXT: vmov r0, r1, d4
				; CHECK-NEXT: vmov r2, r3, d5
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%a.imag = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%b.real = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%b.imag = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x half> %b.imag, %a.real
				%1 = fmul fast <4 x half> %b.real, %a.imag
				%2 = fadd fast <4 x half> %1, %0
				%3 = fmul fast <4 x half> %b.real, %a.real
				%4 = fmul fast <4 x half> %a.imag, %b.imag
				%5 = fsub fast <4 x half> %3, %4
				%interleaved.vec = shufflevector <4 x half> %5, <4 x half> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				ret <8 x half> %interleaved.vec
				}

				define <16 x half> @complex_mul_v16f16(<16 x half> %a, <16 x half> %b) #0 {
				; CHECK-LABEL: complex_mul_v16f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .vsave {d8, d9, d10, d11}
				; CHECK-NEXT: vpush {d8, d9, d10, d11}
				; CHECK-NEXT: vmov d0, r2, r3
				; CHECK-NEXT: vldr d1, [sp, #32]
				; CHECK-NEXT: add r1, sp, #56
				; CHECK-NEXT: vmovx.f16 s4, s0
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vmovx.f16 s6, s1
				; CHECK-NEXT: vmovx.f16 s5, s2
				; CHECK-NEXT: vins.f16 s4, s6
				; CHECK-NEXT: vmovx.f16 s6, s3
				; CHECK-NEXT: add r1, sp, #72
				; CHECK-NEXT: vldrw.u32 q4, [r1]
				; CHECK-NEXT: vins.f16 s5, s6
				; CHECK-NEXT: vmovx.f16 s6, s21
				; CHECK-NEXT: vmovx.f16 s12, s20
				; CHECK-NEXT: vmov.f32 s8, s20
				; CHECK-NEXT: vins.f16 s12, s6
				; CHECK-NEXT: vmov.f32 s9, s22
				; CHECK-NEXT: vmovx.f16 s6, s23
				; CHECK-NEXT: vmovx.f16 s13, s22
				; CHECK-NEXT: add r1, sp, #40
				; CHECK-NEXT: vmovx.f16 s14, s16
				; CHECK-NEXT: vins.f16 s13, s6
				; CHECK-NEXT: vmovx.f16 s6, s17
				; CHECK-NEXT: vins.f16 s8, s21
				; CHECK-NEXT: vins.f16 s9, s23
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vmovx.f16 s15, s18
				; CHECK-NEXT: vins.f16 s14, s6
				; CHECK-NEXT: vmovx.f16 s6, s19
				; CHECK-NEXT: vmovx.f16 s7, s22
				; CHECK-NEXT: vins.f16 s15, s6
				; CHECK-NEXT: vmovx.f16 s6, s20
				; CHECK-NEXT: vmovx.f16 s10, s21
				; CHECK-NEXT: vins.f16 s2, s3
				; CHECK-NEXT: vins.f16 s22, s23
				; CHECK-NEXT: vins.f16 s20, s21
				; CHECK-NEXT: vins.f16 s0, s1
				; CHECK-NEXT: vins.f16 s18, s19
				; CHECK-NEXT: vins.f16 s6, s10
				; CHECK-NEXT: vmovx.f16 s10, s23
				; CHECK-NEXT: vmov.f32 s1, s2
				; CHECK-NEXT: vins.f16 s16, s17
				; CHECK-NEXT: vmov.f32 s2, s20
				; CHECK-NEXT: vins.f16 s7, s10
				; CHECK-NEXT: vmov.f32 s3, s22
				; CHECK-NEXT: vmov.f32 s10, s16
				; CHECK-NEXT: vmov.f32 s11, s18
				; CHECK-NEXT: vmul.f16 q4, q3, q0
				; CHECK-NEXT: vfma.f16 q4, q2, q1
				; CHECK-NEXT: vmul.f16 q1, q1, q3
				; CHECK-NEXT: vneg.f16 q1, q1
				; CHECK-NEXT: vfma.f16 q1, q2, q0
				; CHECK-NEXT: vmovx.f16 s0, s16
				; CHECK-NEXT: vmovx.f16 s8, s4
				; CHECK-NEXT: vmovx.f16 s10, s5
				; CHECK-NEXT: vins.f16 s8, s0
				; CHECK-NEXT: vmovx.f16 s0, s17
				; CHECK-NEXT: vins.f16 s10, s0
				; CHECK-NEXT: vmovx.f16 s1, s6
				; CHECK-NEXT: vmovx.f16 s0, s18
				; CHECK-NEXT: vmovx.f16 s3, s7
				; CHECK-NEXT: vins.f16 s1, s0
				; CHECK-NEXT: vins.f16 s7, s19
				; CHECK-NEXT: vmovx.f16 s0, s19
				; CHECK-NEXT: vins.f16 s6, s18
				; CHECK-NEXT: vins.f16 s3, s0
				; CHECK-NEXT: vins.f16 s4, s16
				; CHECK-NEXT: vins.f16 s5, s17
				; CHECK-NEXT: vmov.f32 s0, s6
				; CHECK-NEXT: vmov.f32 s2, s7
				; CHECK-NEXT: vstrw.32 q0, [r0, #16]
				; CHECK-NEXT: vmov q0, q1
				; CHECK-NEXT: vmov.f32 s1, s8
				; CHECK-NEXT: vmov.f32 s2, s5
				; CHECK-NEXT: vmov.f32 s3, s10
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vpop {d8, d9, d10, d11}
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <16 x half> %a, <16 x half> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%a.imag = shufflevector <16 x half> %a, <16 x half> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%b.real = shufflevector <16 x half> %b, <16 x half> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%b.imag = shufflevector <16 x half> %b, <16 x half> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fmul fast <8 x half> %b.imag, %a.real
				%1 = fmul fast <8 x half> %b.real, %a.imag
				%2 = fadd fast <8 x half> %1, %0
				%3 = fmul fast <8 x half> %b.real, %a.real
				%4 = fmul fast <8 x half> %a.imag, %b.imag
				%5 = fsub fast <8 x half> %3, %4
				%interleaved.vec = shufflevector <8 x half> %5, <8 x half> %2, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				ret <16 x half> %interleaved.vec
				}

				define <32 x half> @complex_mul_v32f16(<32 x half> %a, <32 x half> %b) #0 {
				; CHECK-LABEL: complex_mul_v32f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: .pad #40
				; CHECK-NEXT: sub sp, #40
				; CHECK-NEXT: vmov d2, r2, r3
				; CHECK-NEXT: vldr d3, [sp, #104]
				; CHECK-NEXT: add r1, sp, #112
				; CHECK-NEXT: vmovx.f16 s0, s4
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vmovx.f16 s2, s5
				; CHECK-NEXT: vins.f16 s0, s2
				; CHECK-NEXT: vmovx.f16 s1, s6
				; CHECK-NEXT: vmovx.f16 s2, s7
				; CHECK-NEXT: vmovx.f16 s12, s9
				; CHECK-NEXT: vins.f16 s1, s2
				; CHECK-NEXT: vmovx.f16 s2, s8
				; CHECK-NEXT: vins.f16 s2, s12
				; CHECK-NEXT: vmovx.f16 s3, s10
				; CHECK-NEXT: vmovx.f16 s12, s11
				; CHECK-NEXT: add r1, sp, #160
				; CHECK-NEXT: vins.f16 s3, s12
				; CHECK-NEXT: vldrw.u32 q3, [r1]
				; CHECK-NEXT: add r1, sp, #176
				; CHECK-NEXT: vins.f16 s4, s5
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vmovx.f16 s16, s12
				; CHECK-NEXT: vmovx.f16 s5, s13
				; CHECK-NEXT: vmovx.f16 s17, s14
				; CHECK-NEXT: vins.f16 s16, s5
				; CHECK-NEXT: vmovx.f16 s5, s15
				; CHECK-NEXT: vins.f16 s17, s5
				; CHECK-NEXT: vmovx.f16 s18, s20
				; CHECK-NEXT: vmovx.f16 s5, s21
				; CHECK-NEXT: vmovx.f16 s19, s22
				; CHECK-NEXT: vins.f16 s18, s5
				; CHECK-NEXT: vmovx.f16 s5, s23
				; CHECK-NEXT: vins.f16 s6, s7
				; CHECK-NEXT: vins.f16 s10, s11
				; CHECK-NEXT: vins.f16 s8, s9
				; CHECK-NEXT: vins.f16 s19, s5
				; CHECK-NEXT: vmov.f32 s5, s6
				; CHECK-NEXT: vins.f16 s14, s15
				; CHECK-NEXT: vins.f16 s22, s23
				; CHECK-NEXT: vins.f16 s20, s21
				; CHECK-NEXT: vmov.f32 s6, s8
				; CHECK-NEXT: vins.f16 s12, s13
				; CHECK-NEXT: vmov.f32 s7, s10
				; CHECK-NEXT: add r1, sp, #144
				; CHECK-NEXT: vmov.f32 s13, s14
				; CHECK-NEXT: vldrw.u32 q7, [r1]
				; CHECK-NEXT: vmov.f32 s14, s20
				; CHECK-NEXT: add r1, sp, #192
				; CHECK-NEXT: vmov.f32 s15, s22
				; CHECK-NEXT: vmul.f16 q5, q4, q1
				; CHECK-NEXT: vfma.f16 q5, q3, q0
				; CHECK-NEXT: vmul.f16 q0, q0, q4
				; CHECK-NEXT: vneg.f16 q2, q0
				; CHECK-NEXT: vmovx.f16 s0, s20
				; CHECK-NEXT: vfma.f16 q2, q3, q1
				; CHECK-NEXT: vldrw.u32 q1, [r1]
				; CHECK-NEXT: vmovx.f16 s2, s8
				; CHECK-NEXT: vmovx.f16 s1, s10
				; CHECK-NEXT: vins.f16 s2, s0
				; CHECK-NEXT: vmovx.f16 s0, s21
				; CHECK-NEXT: vstr s2, [sp, #36] @ 4-byte Spill
				; CHECK-NEXT: vmovx.f16 s2, s9
				; CHECK-NEXT: vins.f16 s2, s0
				; CHECK-NEXT: vmovx.f16 s0, s22
				; CHECK-NEXT: vins.f16 s1, s0
				; CHECK-NEXT: vmovx.f16 s3, s11
				; CHECK-NEXT: vmovx.f16 s0, s23
				; CHECK-NEXT: add r1, sp, #208
				; CHECK-NEXT: vins.f16 s3, s0
				; CHECK-NEXT: vmovx.f16 s14, s28
				; CHECK-NEXT: vstrw.32 q0, [sp] @ 16-byte Spill
				; CHECK-NEXT: vmovx.f16 s0, s29
				; CHECK-NEXT: vins.f16 s8, s20
				; CHECK-NEXT: vins.f16 s9, s21
				; CHECK-NEXT: vins.f16 s10, s22
				; CHECK-NEXT: vins.f16 s11, s23
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vins.f16 s14, s0
				; CHECK-NEXT: vmovx.f16 s15, s30
				; CHECK-NEXT: vmovx.f16 s0, s31
				; CHECK-NEXT: vins.f16 s15, s0
				; CHECK-NEXT: vmovx.f16 s27, s22
				; CHECK-NEXT: vmovx.f16 s0, s23
				; CHECK-NEXT: add r1, sp, #128
				; CHECK-NEXT: vstr s2, [sp, #32] @ 4-byte Spill
				; CHECK-NEXT: vins.f16 s27, s0
				; CHECK-NEXT: vldrw.u32 q0, [r1]
				; CHECK-NEXT: vstrw.32 q2, [sp, #16] @ 16-byte Spill
				; CHECK-NEXT: vmovx.f16 s24, s4
				; CHECK-NEXT: vmovx.f16 s25, s6
				; CHECK-NEXT: vmovx.f16 s12, s0
				; CHECK-NEXT: vmovx.f16 s8, s1
				; CHECK-NEXT: vins.f16 s0, s1
				; CHECK-NEXT: vmov.f32 s1, s2
				; CHECK-NEXT: vins.f16 s12, s8
				; CHECK-NEXT: vmovx.f16 s8, s5
				; CHECK-NEXT: vins.f16 s1, s3
				; CHECK-NEXT: vins.f16 s24, s8
				; CHECK-NEXT: vmov q2, q0
				; CHECK-NEXT: vmovx.f16 s0, s7
				; CHECK-NEXT: vins.f16 s25, s0
				; CHECK-NEXT: vmovx.f16 s0, s21
				; CHECK-NEXT: vmovx.f16 s26, s20
				; CHECK-NEXT: vins.f16 s30, s31
				; CHECK-NEXT: vins.f16 s28, s29
				; CHECK-NEXT: vins.f16 s26, s0
				; CHECK-NEXT: vmovx.f16 s0, s3
				; CHECK-NEXT: vmovx.f16 s13, s2
				; CHECK-NEXT: vmov.f32 s16, s4
				; CHECK-NEXT: vins.f16 s13, s0
				; CHECK-NEXT: vmov.f32 s17, s6
				; CHECK-NEXT: vins.f16 s22, s23
				; CHECK-NEXT: vmov.f32 s18, s20
				; CHECK-NEXT: vins.f16 s16, s5
				; CHECK-NEXT: vmov.f32 s10, s28
				; CHECK-NEXT: vins.f16 s17, s7
				; CHECK-NEXT: vmov.f32 s11, s30
				; CHECK-NEXT: vins.f16 s18, s21
				; CHECK-NEXT: vmul.f16 q1, q6, q2
				; CHECK-NEXT: vmov.f32 s19, s22
				; CHECK-NEXT: vmul.f16 q0, q3, q6
				; CHECK-NEXT: vfma.f16 q1, q4, q3
				; CHECK-NEXT: vneg.f16 q3, q0
				; CHECK-NEXT: vmovx.f16 s0, s6
				; CHECK-NEXT: vfma.f16 q3, q4, q2
				; CHECK-NEXT: vmovx.f16 s1, s14
				; CHECK-NEXT: vmovx.f16 s3, s15
				; CHECK-NEXT: vins.f16 s1, s0
				; CHECK-NEXT: vmovx.f16 s0, s7
				; CHECK-NEXT: vins.f16 s14, s6
				; CHECK-NEXT: vins.f16 s3, s0
				; CHECK-NEXT: vmovx.f16 s6, s12
				; CHECK-NEXT: vmovx.f16 s0, s4
				; CHECK-NEXT: vins.f16 s15, s7
				; CHECK-NEXT: vins.f16 s12, s4
				; CHECK-NEXT: vins.f16 s6, s0
				; CHECK-NEXT: vmovx.f16 s4, s13
				; CHECK-NEXT: vins.f16 s13, s5
				; CHECK-NEXT: vmovx.f16 s0, s5
				; CHECK-NEXT: vmov q2, q3
				; CHECK-NEXT: vins.f16 s4, s0
				; CHECK-NEXT: vmov.f32 s0, s14
				; CHECK-NEXT: vmov.f32 s2, s15
				; CHECK-NEXT: vmov.f32 s9, s6
				; CHECK-NEXT: vstrw.32 q0, [r0, #48]
				; CHECK-NEXT: vmov.f32 s10, s13
				; CHECK-NEXT: vmov.f32 s11, s4
				; CHECK-NEXT: vstrw.32 q2, [r0, #32]
				; CHECK-NEXT: vldrw.u32 q1, [sp, #16] @ 16-byte Reload
				; CHECK-NEXT: vldrw.u32 q0, [sp] @ 16-byte Reload
				; CHECK-NEXT: vmov.f32 s0, s6
				; CHECK-NEXT: vmov.f32 s2, s7
				; CHECK-NEXT: vstrw.32 q0, [r0, #16]
				; CHECK-NEXT: vmov q0, q1
				; CHECK-NEXT: vldr s1, [sp, #36] @ 4-byte Reload
				; CHECK-NEXT: vmov.f32 s2, s5
				; CHECK-NEXT: vldr s3, [sp, #32] @ 4-byte Reload
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: add sp, #40
				; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <32 x half> %a, <32 x half> poison, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%a.imag = shufflevector <32 x half> %a, <32 x half> poison, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%b.real = shufflevector <32 x half> %b, <32 x half> poison, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%b.imag = shufflevector <32 x half> %b, <32 x half> poison, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%0 = fmul fast <16 x half> %b.imag, %a.real
				%1 = fmul fast <16 x half> %b.real, %a.imag
				%2 = fadd fast <16 x half> %1, %0
				%3 = fmul fast <16 x half> %b.real, %a.real
				%4 = fmul fast <16 x half> %a.imag, %b.imag
				%5 = fsub fast <16 x half> %3, %4
				%interleaved.vec = shufflevector <16 x half> %5, <16 x half> %2, <32 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
				ret <32 x half> %interleaved.vec
				}

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s --mattr=+mve.fp -o - \| FileCheck %s

				target triple = "thumbv8.1m.main-none-none-eabi"



				define <2 x float> @complex_add_v2f32(<2 x float> %a, <2 x float> %b) #0 {
				; CHECK-LABEL: complex_add_v2f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: mov r2, sp
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: vldrw.u32 q1, [r2]
				; CHECK-NEXT: vmov r2, r3, d0
				; CHECK-NEXT: vadd.f32 s5, s5, s0
				; CHECK-NEXT: vsub.f32 s4, s4, s1
				; CHECK-NEXT: vmov r0, r1, d2
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <2 x float> %a, <2 x float> zeroinitializer, <1 x i32> <i32 0>
				%a.imag = shufflevector <2 x float> %a, <2 x float> zeroinitializer, <1 x i32> <i32 1>
				%b.real = shufflevector <2 x float> %b, <2 x float> zeroinitializer, <1 x i32> <i32 0>
				%b.imag = shufflevector <2 x float> %b, <2 x float> zeroinitializer, <1 x i32> <i32 1>
				%0 = fsub fast <1 x float> %b.real, %a.imag
				%1 = fadd fast <1 x float> %b.imag, %a.real
				%interleaved.vec = shufflevector <1 x float> %0, <1 x float> %1, <2 x i32> <i32 0, i32 1>
				ret <2 x float> %interleaved.vec
				}
				define <4 x float> @complex_add_v4f32(<4 x float> %a, <4 x float> %b) #0 {
				; CHECK-LABEL: complex_add_v4f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: mov r0, sp
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vmov d1, r2, r3
				; CHECK-NEXT: vcadd.f32 q2, q1, q0, #90
				; CHECK-NEXT: vmov r0, r1, d4
				; CHECK-NEXT: vmov r2, r3, d5
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <4 x float> %a, <4 x float> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%a.imag = shufflevector <4 x float> %a, <4 x float> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%b.real = shufflevector <4 x float> %b, <4 x float> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%b.imag = shufflevector <4 x float> %b, <4 x float> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%0 = fsub fast <2 x float> %b.real, %a.imag
				%1 = fadd fast <2 x float> %b.imag, %a.real
				%interleaved.vec = shufflevector <2 x float> %0, <2 x float> %1, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x float> %interleaved.vec
				}
				define <8 x float> @complex_add_v8f32(<8 x float> %a, <8 x float> %b) #0 {
				; CHECK-LABEL: complex_add_v8f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: .vsave {d8, d9}
				; CHECK-NEXT: vpush {d8, d9}
				; CHECK-NEXT: add.w lr, sp, #48
				; CHECK-NEXT: add r1, sp, #64
				; CHECK-NEXT: vldrw.u32 q3, [r1]
				; CHECK-NEXT: vldrw.u32 q1, [lr]
				; CHECK-NEXT: add.w r12, sp, #32
				; CHECK-NEXT: vldr d1, [sp, #24]
				; CHECK-NEXT: vmov d0, r2, r3
				; CHECK-NEXT: vmov.f32 s8, s5
				; CHECK-NEXT: vmov.f32 s9, s7
				; CHECK-NEXT: vmov.f32 s5, s6
				; CHECK-NEXT: vmov.f32 s10, s13
				; CHECK-NEXT: vmov.f32 s11, s15
				; CHECK-NEXT: vmov.f32 s6, s12
				; CHECK-NEXT: vmov.f32 s7, s14
				; CHECK-NEXT: vldrw.u32 q3, [r12]
				; CHECK-NEXT: vmov.f32 s16, s0
				; CHECK-NEXT: vmov.f32 s17, s2
				; CHECK-NEXT: vmov.f32 s0, s1
				; CHECK-NEXT: vmov.f32 s1, s3
				; CHECK-NEXT: vmov.f32 s18, s12
				; CHECK-NEXT: vmov.f32 s19, s14
				; CHECK-NEXT: vmov.f32 s2, s13
				; CHECK-NEXT: vadd.f32 q2, q2, q4
				; CHECK-NEXT: vmov.f32 s3, s15
				; CHECK-NEXT: vsub.f32 q0, q1, q0
				; CHECK-NEXT: vmov.f32 s5, s10
				; CHECK-NEXT: vmov.f32 s4, s2
				; CHECK-NEXT: vmov.f32 s6, s3
				; CHECK-NEXT: vmov.f32 s7, s11
				; CHECK-NEXT: vstrw.32 q1, [r0, #16]
				; CHECK-NEXT: vmov.f32 s4, s0
				; CHECK-NEXT: vmov.f32 s5, s8
				; CHECK-NEXT: vmov.f32 s6, s1
				; CHECK-NEXT: vmov.f32 s7, s9
				; CHECK-NEXT: vstrw.32 q1, [r0]
				; CHECK-NEXT: vpop {d8, d9}
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%a.real = shufflevector <8 x float> %a, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%a.imag = shufflevector <8 x float> %a, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%b.real = shufflevector <8 x float> %b, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%b.imag = shufflevector <8 x float> %b, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fsub fast <4 x float> %b.real, %a.imag
				chillUnsubmitted Done Reply Inline Actions Shouldn't these be translated to a couple of `vcadd.f32` instructions, like in the previous test? And this amount of move instructions seems excessive. chill: Shouldn't these be translated to a couple of `vcadd.f32` instructions, like in the previous…
				chillUnsubmitted Done Reply Inline Actions And this amount of move instructions seems excessive. I guess MVE does not have sensible swizzling instructions. chill: > And this amount of move instructions seems excessive. I guess MVE does not have sensible…
				dmgreenUnsubmitted Done Reply Inline Actions Yeah this is expected from shuffles that MVE cannot handle very well. It would look a lot better either with loads that could become interleaving loads, or under AArch64 where better shuffles are available. dmgreen: Yeah this is expected from shuffles that MVE cannot handle very well. It would look a lot…
				NickGuyAuthorUnsubmitted Done Reply Inline Actions Shouldn't these be translated to a couple of vcadd.f32 instructions, like in the previous test? Not in this case, no. The IR vectors are too wide to fit into actual vector registers, and the follow-up vector splitting patch (D129067) is restricted to splitting in half. NickGuy: > Shouldn't these be translated to a couple of vcadd.f32 instructions, like in the previous…
				%1 = fadd fast <4 x float> %b.imag, %a.real
				%interleaved.vec = shufflevector <4 x float> %0, <4 x float> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				ret <8 x float> %interleaved.vec
				}
				define <16 x float> @complex_add_v16f32(<16 x float> %a, <16 x float> %b) #0 {
				; CHECK-LABEL: complex_add_v16f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, r5, r6, lr}
				; CHECK-NEXT: push {r4, r5, r6, lr}
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13}
				; CHECK-NEXT: add r4, sp, #152
				; CHECK-NEXT: add r5, sp, #88
				; CHECK-NEXT: add r6, sp, #168
				; CHECK-NEXT: vmov d0, r2, r3
				; CHECK-NEXT: add r2, sp, #104
				; CHECK-NEXT: vldrw.u32 q5, [r5]
				; CHECK-NEXT: vldrw.u32 q2, [r2]
				; CHECK-NEXT: vldrw.u32 q3, [r6]
				; CHECK-NEXT: vldrw.u32 q4, [r4]
				; CHECK-NEXT: vmov.f32 s4, s20
				; CHECK-NEXT: vmov.f32 s5, s22
				; CHECK-NEXT: add.w lr, sp, #120
				; CHECK-NEXT: vmov.f32 s24, s17
				; CHECK-NEXT: add r1, sp, #136
				; CHECK-NEXT: vmov.f32 s25, s19
				; CHECK-NEXT: add.w r12, sp, #72
				; CHECK-NEXT: vmov.f32 s6, s8
				; CHECK-NEXT: vldr d1, [sp, #64]
				; CHECK-NEXT: vmov.f32 s7, s10
				; CHECK-NEXT: vmov.f32 s26, s13
				; CHECK-NEXT: vmov.f32 s27, s15
				; CHECK-NEXT: vadd.f32 q1, q6, q1
				; CHECK-NEXT: vmov.f32 s20, s21
				; CHECK-NEXT: vmov.f32 s21, s23
				; CHECK-NEXT: vmov.f32 s22, s9
				; CHECK-NEXT: vmov.f32 s23, s11
				; CHECK-NEXT: vmov.f32 s17, s18
				; CHECK-NEXT: vmov.f32 s18, s12
				; CHECK-NEXT: vmov.f32 s19, s14
				; CHECK-NEXT: vsub.f32 q4, q4, q5
				; CHECK-NEXT: vmov.f32 s9, s6
				; CHECK-NEXT: vmov.f32 s11, s7
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vmov.f32 s13, s4
				; CHECK-NEXT: vmov.f32 s15, s5
				; CHECK-NEXT: vldrw.u32 q1, [lr]
				; CHECK-NEXT: vmov.f32 s8, s18
				; CHECK-NEXT: vmov.f32 s10, s19
				; CHECK-NEXT: vmov.f32 s12, s16
				; CHECK-NEXT: vstrw.32 q2, [r0, #48]
				; CHECK-NEXT: vmov.f32 s14, s17
				; CHECK-NEXT: vmov.f32 s16, s5
				; CHECK-NEXT: vstrw.32 q3, [r0, #32]
				; CHECK-NEXT: vmov.f32 s17, s7
				; CHECK-NEXT: vmov.f32 s5, s6
				; CHECK-NEXT: vmov.f32 s18, s21
				; CHECK-NEXT: vmov.f32 s19, s23
				; CHECK-NEXT: vmov.f32 s6, s20
				; CHECK-NEXT: vmov.f32 s7, s22
				; CHECK-NEXT: vldrw.u32 q5, [r12]
				; CHECK-NEXT: vmov.f32 s8, s0
				; CHECK-NEXT: vmov.f32 s9, s2
				; CHECK-NEXT: vmov.f32 s0, s1
				; CHECK-NEXT: vmov.f32 s1, s3
				; CHECK-NEXT: vmov.f32 s10, s20
				; CHECK-NEXT: vmov.f32 s11, s22
				; CHECK-NEXT: vmov.f32 s2, s21
				; CHECK-NEXT: vadd.f32 q2, q4, q2
				; CHECK-NEXT: vmov.f32 s3, s23
				; CHECK-NEXT: vsub.f32 q0, q1, q0
				; CHECK-NEXT: vmov.f32 s5, s10
				; CHECK-NEXT: vmov.f32 s4, s2
				; CHECK-NEXT: vmov.f32 s6, s3
				; CHECK-NEXT: vmov.f32 s7, s11
				; CHECK-NEXT: vstrw.32 q1, [r0, #16]
				; CHECK-NEXT: vmov.f32 s4, s0
				; CHECK-NEXT: vmov.f32 s5, s8
				; CHECK-NEXT: vmov.f32 s6, s1
				; CHECK-NEXT: vmov.f32 s7, s9
				; CHECK-NEXT: vstrw.32 q1, [r0]
				; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13}
				; CHECK-NEXT: pop {r4, r5, r6, pc}
				entry:
				%a.real = shufflevector <16 x float> %a, <16 x float> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%a.imag = shufflevector <16 x float> %a, <16 x float> zeroinitializer, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%b.real = shufflevector <16 x float> %b, <16 x float> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%b.imag = shufflevector <16 x float> %b, <16 x float> zeroinitializer, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fsub fast <8 x float> %b.real, %a.imag
				%1 = fadd fast <8 x float> %b.imag, %a.real
				%interleaved.vec = shufflevector <8 x float> %0, <8 x float> %1, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				ret <16 x float> %interleaved.vec
				}

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-mul.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s --mattr=+mve.fp -o - \| FileCheck %s

				target triple = "thumbv8.1m.main-none-none-eabi"

				define <2 x float> @complex_mul_v2f32(<2 x float> %a, <2 x float> %b) #0 {
				; CHECK-LABEL: complex_mul_v2f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: mov r2, sp
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: vldrw.u32 q1, [r2]
				; CHECK-NEXT: vmov r2, r3, d0
				; CHECK-NEXT: vmul.f32 s9, s5, s0
				; CHECK-NEXT: vmul.f32 s8, s1, s5
				; CHECK-NEXT: vfma.f32 s9, s4, s1
				; CHECK-NEXT: vfnms.f32 s8, s4, s0
				; CHECK-NEXT: vmov r0, r1, d4
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <2 x float> %a, <2 x float> poison, <1 x i32> <i32 0>
				%a.imag = shufflevector <2 x float> %a, <2 x float> poison, <1 x i32> <i32 1>
				%b.real = shufflevector <2 x float> %b, <2 x float> poison, <1 x i32> <i32 0>
				%b.imag = shufflevector <2 x float> %b, <2 x float> poison, <1 x i32> <i32 1>
				%0 = fmul fast <1 x float> %b.imag, %a.real
				%1 = fmul fast <1 x float> %b.real, %a.imag
				%2 = fadd fast <1 x float> %1, %0
				%3 = fmul fast <1 x float> %b.real, %a.real
				%4 = fmul fast <1 x float> %a.imag, %b.imag
				%5 = fsub fast <1 x float> %3, %4
				%interleaved.vec = shufflevector <1 x float> %5, <1 x float> %2, <2 x i32> <i32 0, i32 1>
				ret <2 x float> %interleaved.vec
				}
				dmgreenUnsubmitted Done Reply Inline Actions Add arm_aapcs_vfpcc to any tests that take or return vectors. I think you can remove #0 too. dmgreen: Add arm_aapcs_vfpcc to any tests that take or return vectors. I think you can remove #0 too.

				define <4 x float> @complex_mul_v4f32(<4 x float> %a, <4 x float> %b) #0 {
				; CHECK-LABEL: complex_mul_v4f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: mov r0, sp
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vmov d1, r2, r3
				; CHECK-NEXT: vcmul.f32 q2, q0, q1, #90
				; CHECK-NEXT: vcmla.f32 q2, q0, q1, #0
				; CHECK-NEXT: vmov r0, r1, d4
				; CHECK-NEXT: vmov r2, r3, d5
				; CHECK-NEXT: bx lr
				entry:
				%a.real = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%a.imag = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%b.real = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%b.imag = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x float> %b.imag, %a.real
				%1 = fmul fast <2 x float> %b.real, %a.imag
				%2 = fadd fast <2 x float> %1, %0
				%3 = fmul fast <2 x float> %b.real, %a.real
				%4 = fmul fast <2 x float> %a.imag, %b.imag
				%5 = fsub fast <2 x float> %3, %4
				%interleaved.vec = shufflevector <2 x float> %5, <2 x float> %2, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x float> %interleaved.vec
				}

				define <8 x float> @complex_mul_v8f32(<8 x float> %a, <8 x float> %b) #0 {
				; CHECK-LABEL: complex_mul_v8f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: .vsave {d8, d9, d10, d11}
				; CHECK-NEXT: vpush {d8, d9, d10, d11}
				; CHECK-NEXT: add.w lr, sp, #64
				; CHECK-NEXT: add.w r12, sp, #48
				; CHECK-NEXT: add r1, sp, #80
				; CHECK-NEXT: vldrw.u32 q3, [lr]
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vldrw.u32 q4, [r12]
				; CHECK-NEXT: vldr d1, [sp, #40]
				; CHECK-NEXT: vmov d0, r2, r3
				; CHECK-NEXT: vmov.f32 s4, s12
				; CHECK-NEXT: vmov.f32 s5, s14
				; CHECK-NEXT: vmov.f32 s12, s13
				; CHECK-NEXT: vmov.f32 s13, s15
				; CHECK-NEXT: vmov.f32 s20, s0
				; CHECK-NEXT: vmov.f32 s21, s2
				; CHECK-NEXT: vmov.f32 s14, s9
				; CHECK-NEXT: vmov.f32 s15, s11
				; CHECK-NEXT: vmov.f32 s22, s16
				; CHECK-NEXT: vmov.f32 s23, s18
				; CHECK-NEXT: vmov.f32 s0, s1
				; CHECK-NEXT: vmov.f32 s1, s3
				; CHECK-NEXT: vmov.f32 s6, s8
				; CHECK-NEXT: vmov.f32 s7, s10
				; CHECK-NEXT: vmul.f32 q2, q3, q5
				; CHECK-NEXT: vmov.f32 s2, s17
				; CHECK-NEXT: vmov.f32 s3, s19
				; CHECK-NEXT: vfma.f32 q2, q1, q0
				; CHECK-NEXT: vmul.f32 q0, q0, q3
				; CHECK-NEXT: vneg.f32 q0, q0
				; CHECK-NEXT: vfma.f32 q0, q1, q5
				; CHECK-NEXT: vmov.f32 s5, s10
				; CHECK-NEXT: vmov.f32 s4, s2
				; CHECK-NEXT: vmov.f32 s6, s3
				; CHECK-NEXT: vmov.f32 s7, s11
				; CHECK-NEXT: vstrw.32 q1, [r0, #16]
				; CHECK-NEXT: vmov.f32 s4, s0
				; CHECK-NEXT: vmov.f32 s5, s8
				; CHECK-NEXT: vmov.f32 s6, s1
				; CHECK-NEXT: vmov.f32 s7, s9
				; CHECK-NEXT: vstrw.32 q1, [r0]
				; CHECK-NEXT: vpop {d8, d9, d10, d11}
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%a.real = shufflevector <8 x float> %a, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%a.imag = shufflevector <8 x float> %a, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%b.real = shufflevector <8 x float> %b, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%b.imag = shufflevector <8 x float> %b, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x float> %b.imag, %a.real
				%1 = fmul fast <4 x float> %b.real, %a.imag
				%2 = fadd fast <4 x float> %1, %0
				%3 = fmul fast <4 x float> %b.real, %a.real
				%4 = fmul fast <4 x float> %a.imag, %b.imag
				%5 = fsub fast <4 x float> %3, %4
				%interleaved.vec = shufflevector <4 x float> %5, <4 x float> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				ret <8 x float> %interleaved.vec
				}

				define <16 x float> @complex_mul_v16f32(<16 x float> %a, <16 x float> %b) #0 {
				; CHECK-LABEL: complex_mul_v16f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, r5, r6, lr}
				; CHECK-NEXT: push {r4, r5, r6, lr}
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: add r4, sp, #184
				; CHECK-NEXT: add r5, sp, #168
				; CHECK-NEXT: add r6, sp, #120
				; CHECK-NEXT: vmov d0, r2, r3
				; CHECK-NEXT: add r2, sp, #104
				; CHECK-NEXT: vldrw.u32 q4, [r6]
				; CHECK-NEXT: vldrw.u32 q6, [r2]
				; CHECK-NEXT: vldrw.u32 q5, [r4]
				; CHECK-NEXT: vldrw.u32 q3, [r5]
				; CHECK-NEXT: vmov.f32 s10, s16
				; CHECK-NEXT: vmov.f32 s8, s24
				; CHECK-NEXT: add.w lr, sp, #136
				; CHECK-NEXT: vmov.f32 s9, s26
				; CHECK-NEXT: add r1, sp, #152
				; CHECK-NEXT: vmov.f32 s28, s13
				; CHECK-NEXT: add.w r12, sp, #88
				; CHECK-NEXT: vmov.f32 s29, s15
				; CHECK-NEXT: vldr d1, [sp, #80]
				; CHECK-NEXT: vmov.f32 s24, s25
				; CHECK-NEXT: vmov.f32 s25, s27
				; CHECK-NEXT: vmov.f32 s30, s21
				; CHECK-NEXT: vmov.f32 s31, s23
				; CHECK-NEXT: vmov.f32 s26, s17
				; CHECK-NEXT: vmov.f32 s27, s19
				; CHECK-NEXT: vmov.f32 s11, s18
				; CHECK-NEXT: vmul.f32 q4, q6, q7
				; CHECK-NEXT: vmov.f32 s13, s14
				; CHECK-NEXT: vmul.f32 q1, q7, q2
				; CHECK-NEXT: vmov.f32 s14, s20
				; CHECK-NEXT: vneg.f32 q4, q4
				; CHECK-NEXT: vmov.f32 s15, s22
				; CHECK-NEXT: vldrw.u32 q5, [r1]
				; CHECK-NEXT: vfma.f32 q4, q3, q2
				; CHECK-NEXT: vfma.f32 q1, q3, q6
				; CHECK-NEXT: vmov.f32 s13, s4
				; CHECK-NEXT: vmov.f32 s15, s5
				; CHECK-NEXT: vmov.f32 s5, s6
				; CHECK-NEXT: vmov.f32 s12, s16
				; CHECK-NEXT: vmov.f32 s14, s17
				; CHECK-NEXT: vmov.f32 s4, s18
				; CHECK-NEXT: vmov.f32 s6, s19
				; CHECK-NEXT: vldrw.u32 q4, [lr]
				; CHECK-NEXT: vmov.f32 s10, s20
				; CHECK-NEXT: vstrw.32 q1, [r0, #48]
				; CHECK-NEXT: vmov.f32 s8, s16
				; CHECK-NEXT: vstrw.32 q3, [r0, #32]
				; CHECK-NEXT: vmov.f32 s9, s18
				; CHECK-NEXT: vmov.f32 s16, s17
				; CHECK-NEXT: vmov.f32 s17, s19
				; CHECK-NEXT: vmov.f32 s11, s22
				; CHECK-NEXT: vmov.f32 s18, s21
				; CHECK-NEXT: vmov.f32 s19, s23
				; CHECK-NEXT: vldrw.u32 q5, [r12]
				; CHECK-NEXT: vmov.f32 s12, s0
				; CHECK-NEXT: vmov.f32 s13, s2
				; CHECK-NEXT: vmov.f32 s14, s20
				; CHECK-NEXT: vmov.f32 s15, s22
				; CHECK-NEXT: vmov.f32 s0, s1
				; CHECK-NEXT: vmul.f32 q1, q4, q3
				; CHECK-NEXT: vmov.f32 s1, s3
				; CHECK-NEXT: vmov.f32 s2, s21
				; CHECK-NEXT: vmov.f32 s3, s23
				; CHECK-NEXT: vfma.f32 q1, q2, q0
				; CHECK-NEXT: vmul.f32 q0, q0, q4
				; CHECK-NEXT: vneg.f32 q0, q0
				; CHECK-NEXT: vfma.f32 q0, q2, q3
				; CHECK-NEXT: vmov.f32 s9, s6
				; CHECK-NEXT: vmov.f32 s8, s2
				; CHECK-NEXT: vmov.f32 s10, s3
				; CHECK-NEXT: vmov.f32 s11, s7
				; CHECK-NEXT: vstrw.32 q2, [r0, #16]
				; CHECK-NEXT: vmov.f32 s8, s0
				; CHECK-NEXT: vmov.f32 s9, s4
				; CHECK-NEXT: vmov.f32 s10, s1
				; CHECK-NEXT: vmov.f32 s11, s5
				; CHECK-NEXT: vstrw.32 q2, [r0]
				; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: pop {r4, r5, r6, pc}
				entry:
				%a.real = shufflevector <16 x float> %a, <16 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%a.imag = shufflevector <16 x float> %a, <16 x float> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%b.real = shufflevector <16 x float> %b, <16 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%b.imag = shufflevector <16 x float> %b, <16 x float> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fmul fast <8 x float> %b.imag, %a.real
				%1 = fmul fast <8 x float> %b.real, %a.imag
				%2 = fadd fast <8 x float> %1, %0
				%3 = fmul fast <8 x float> %b.real, %a.real
				%4 = fmul fast <8 x float> %a.imag, %b.imag
				%5 = fsub fast <8 x float> %3, %4
				%interleaved.vec = shufflevector <8 x float> %5, <8 x float> %2, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				ret <16 x float> %interleaved.vec
				}

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f64-add.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s --mattr=+mve.fp -o - \| FileCheck %s

				target triple = "thumbv8.1m.main-none-none-eabi"



				define <2 x double> @complex_add_v2f64(<2 x double> %a, <2 x double> %b) #0 {
				; CHECK-LABEL: complex_add_v2f64:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, r5, r6, r7, lr}
				; CHECK-NEXT: push {r4, r5, r6, r7, lr}
				; CHECK-NEXT: .pad #4
				; CHECK-NEXT: sub sp, #4
				; CHECK-NEXT: .vsave {d8, d9}
				; CHECK-NEXT: vpush {d8, d9}
				; CHECK-NEXT: mov r5, r0
				; CHECK-NEXT: add r0, sp, #40
				; CHECK-NEXT: vldrw.u32 q4, [r0]
				; CHECK-NEXT: mov r4, r1
				; CHECK-NEXT: vmov r0, r1, d8
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: mov r6, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: vmov r0, r1, d9
				; CHECK-NEXT: mov r2, r5
				; CHECK-NEXT: mov r3, r4
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: mov r2, r0
				; CHECK-NEXT: mov r3, r1
				; CHECK-NEXT: mov r0, r6
				; CHECK-NEXT: mov r1, r7
				; CHECK-NEXT: vpop {d8, d9}
				; CHECK-NEXT: add sp, #4
				; CHECK-NEXT: pop {r4, r5, r6, r7, pc}
				entry:
				%a.real = shufflevector <2 x double> %a, <2 x double> zeroinitializer, <1 x i32> <i32 0>
				%a.imag = shufflevector <2 x double> %a, <2 x double> zeroinitializer, <1 x i32> <i32 1>
				%b.real = shufflevector <2 x double> %b, <2 x double> zeroinitializer, <1 x i32> <i32 0>
				%b.imag = shufflevector <2 x double> %b, <2 x double> zeroinitializer, <1 x i32> <i32 1>
				%0 = fsub fast <1 x double> %b.real, %a.imag
				%1 = fadd fast <1 x double> %b.imag, %a.real
				%interleaved.vec = shufflevector <1 x double> %0, <1 x double> %1, <2 x i32> <i32 0, i32 1>
				ret <2 x double> %interleaved.vec
				}
				define <4 x double> @complex_add_v4f64(<4 x double> %a, <4 x double> %b) #0 {
				; CHECK-LABEL: complex_add_v4f64:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, lr}
				; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, lr}
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13}
				; CHECK-NEXT: mov r10, r0
				; CHECK-NEXT: add r0, sp, #104
				; CHECK-NEXT: vldrw.u32 q4, [r0]
				; CHECK-NEXT: mov r8, r3
				; CHECK-NEXT: mov r9, r2
				; CHECK-NEXT: add r2, sp, #88
				; CHECK-NEXT: vmov r0, r1, d8
				; CHECK-NEXT: add r3, sp, #120
				; CHECK-NEXT: vldrw.u32 q5, [r3]
				; CHECK-NEXT: vldrw.u32 q6, [r2]
				; CHECK-NEXT: ldrd r2, r3, [sp, #80]
				; CHECK-NEXT: vmov r7, r5, d10
				; CHECK-NEXT: vmov r6, r4, d13
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: vmov d8, r0, r1
				; CHECK-NEXT: mov r0, r7
				; CHECK-NEXT: mov r1, r5
				; CHECK-NEXT: mov r2, r6
				; CHECK-NEXT: mov r3, r4
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: mov r5, r1
				; CHECK-NEXT: vmov r0, r1, d11
				; CHECK-NEXT: vmov r2, r3, d12
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: vmov r2, r3, d9
				; CHECK-NEXT: vmov d11, r0, r1
				; CHECK-NEXT: vmov d10, r4, r5
				; CHECK-NEXT: mov r0, r2
				; CHECK-NEXT: mov r1, r3
				; CHECK-NEXT: mov r2, r9
				; CHECK-NEXT: mov r3, r8
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: vmov d9, r0, r1
				; CHECK-NEXT: vstrw.32 q5, [r10, #16]
				; CHECK-NEXT: vstrw.32 q4, [r10]
				; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13}
				; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, pc}
				entry:
				%a.real = shufflevector <4 x double> %a, <4 x double> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%a.imag = shufflevector <4 x double> %a, <4 x double> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%b.real = shufflevector <4 x double> %b, <4 x double> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%b.imag = shufflevector <4 x double> %b, <4 x double> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%0 = fsub fast <2 x double> %b.real, %a.imag
				%1 = fadd fast <2 x double> %b.imag, %a.real
				%interleaved.vec = shufflevector <2 x double> %0, <2 x double> %1, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x double> %interleaved.vec
				}
				define <8 x double> @complex_add_v8f64(<8 x double> %a, <8 x double> %b) #0 {
				; CHECK-LABEL: complex_add_v8f64:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, r11, lr}
				; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
				; CHECK-NEXT: .pad #4
				; CHECK-NEXT: sub sp, #4
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: .pad #48
				; CHECK-NEXT: sub sp, #48
				; CHECK-NEXT: strd r2, r3, [sp, #40] @ 8-byte Folded Spill
				; CHECK-NEXT: mov r8, r0
				; CHECK-NEXT: add r2, sp, #160
				; CHECK-NEXT: add r0, sp, #224
				; CHECK-NEXT: vldrw.u32 q6, [r0]
				; CHECK-NEXT: vldrw.u32 q0, [r2]
				; CHECK-NEXT: vmov r0, r1, d12
				; CHECK-NEXT: vmov r2, r3, d1
				; CHECK-NEXT: vmov.f32 s24, s0
				; CHECK-NEXT: vmov.f32 s25, s1
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: strd r1, r0, [sp, #16] @ 8-byte Folded Spill
				; CHECK-NEXT: add r0, sp, #208
				; CHECK-NEXT: vldrw.u32 q7, [r0]
				; CHECK-NEXT: ldrd r2, r3, [sp, #152]
				; CHECK-NEXT: vmov r0, r1, d14
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: add r5, sp, #256
				; CHECK-NEXT: add r3, sp, #240
				; CHECK-NEXT: vldrw.u32 q0, [r5]
				; CHECK-NEXT: add r2, sp, #176
				; CHECK-NEXT: vldrw.u32 q5, [r3]
				; CHECK-NEXT: vldrw.u32 q4, [r2]
				; CHECK-NEXT: vmov r4, r5, d0
				; CHECK-NEXT: vmov r6, r7, d10
				; CHECK-NEXT: vmov r2, r3, d9
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: vmov.f32 s18, s2
				; CHECK-NEXT: vstrw.32 q0, [sp, #24] @ 16-byte Spill
				; CHECK-NEXT: vmov.f32 s19, s3
				; CHECK-NEXT: vmov r10, r9, d13
				; CHECK-NEXT: strd r4, r5, [sp] @ 8-byte Folded Spill
				; CHECK-NEXT: vmov r4, r5, d15
				; CHECK-NEXT: mov r0, r6
				; CHECK-NEXT: mov r1, r7
				; CHECK-NEXT: strd r4, r5, [sp, #8] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: mov r6, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: vmov r0, r1, d11
				; CHECK-NEXT: vmov r2, r3, d8
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: vmov r2, r3, d12
				; CHECK-NEXT: add r5, sp, #192
				; CHECK-NEXT: vldrw.u32 q7, [r5]
				; CHECK-NEXT: vmov d11, r0, r1
				; CHECK-NEXT: mov r0, r10
				; CHECK-NEXT: mov r1, r9
				; CHECK-NEXT: vmov r11, r5, d15
				; CHECK-NEXT: vmov d10, r6, r7
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: vmov d13, r0, r1
				; CHECK-NEXT: ldrd r1, r0, [sp, #16] @ 8-byte Folded Reload
				; CHECK-NEXT: mov r2, r11
				; CHECK-NEXT: mov r3, r5
				; CHECK-NEXT: vmov d12, r0, r1
				; CHECK-NEXT: ldrd r0, r1, [sp] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: mov r5, r0
				; CHECK-NEXT: mov r6, r1
				; CHECK-NEXT: vmov r0, r1, d9
				; CHECK-NEXT: vmov r2, r3, d14
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: vmov d9, r0, r1
				; CHECK-NEXT: ldrd r0, r1, [sp, #8] @ 8-byte Folded Reload
				; CHECK-NEXT: ldrd r2, r3, [sp, #40] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: vldrw.u32 q0, [sp, #24] @ 16-byte Reload
				; CHECK-NEXT: vmov d8, r5, r6
				; CHECK-NEXT: vmov d1, r0, r1
				; CHECK-NEXT: vstrw.32 q4, [r8, #48]
				; CHECK-NEXT: vstrw.32 q5, [r8, #32]
				; CHECK-NEXT: vstrw.32 q6, [r8, #16]
				; CHECK-NEXT: vstrw.32 q0, [r8]
				; CHECK-NEXT: add sp, #48
				; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: add sp, #4
				; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
				entry:
				%a.real = shufflevector <8 x double> %a, <8 x double> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%a.imag = shufflevector <8 x double> %a, <8 x double> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%b.real = shufflevector <8 x double> %b, <8 x double> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%b.imag = shufflevector <8 x double> %b, <8 x double> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fsub fast <4 x double> %b.real, %a.imag
				%1 = fadd fast <4 x double> %b.imag, %a.real
				%interleaved.vec = shufflevector <4 x double> %0, <4 x double> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				ret <8 x double> %interleaved.vec
				}

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f64-mul.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s --mattr=+mve.fp -o - \| FileCheck %s
				dmgreenUnsubmitted Done Reply Inline Actions It might be a little cleaner to add +fp64 for all these f64 tests. dmgreen: It might be a little cleaner to add +fp64 for all these f64 tests.

				target triple = "thumbv8.1m.main-none-none-eabi"

				define <2 x double> @complex_mul_v2f64(<2 x double> %a, <2 x double> %b) #0 {
				; CHECK-LABEL: complex_mul_v2f64:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, r11, lr}
				; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
				; CHECK-NEXT: .pad #4
				; CHECK-NEXT: sub sp, #4
				; CHECK-NEXT: .vsave {d8, d9}
				; CHECK-NEXT: vpush {d8, d9}
				; CHECK-NEXT: .pad #8
				; CHECK-NEXT: sub sp, #8
				; CHECK-NEXT: mov r5, r0
				; CHECK-NEXT: add r0, sp, #64
				; CHECK-NEXT: vldrw.u32 q4, [r0]
				; CHECK-NEXT: mov r4, r1
				; CHECK-NEXT: mov r10, r3
				; CHECK-NEXT: mov r11, r2
				; CHECK-NEXT: vmov r0, r1, d9
				; CHECK-NEXT: mov r2, r5
				; CHECK-NEXT: mov r3, r4
				; CHECK-NEXT: strd r0, r1, [sp] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: vmov r8, r9, d8
				; CHECK-NEXT: mov r6, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: mov r2, r11
				; CHECK-NEXT: mov r3, r10
				; CHECK-NEXT: mov r0, r8
				; CHECK-NEXT: mov r1, r9
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r6
				; CHECK-NEXT: mov r3, r7
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: mov r6, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: mov r0, r8
				; CHECK-NEXT: mov r1, r9
				; CHECK-NEXT: mov r2, r5
				; CHECK-NEXT: mov r3, r4
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: ldrd r2, r3, [sp] @ 8-byte Folded Reload
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: mov r5, r1
				; CHECK-NEXT: mov r0, r11
				; CHECK-NEXT: mov r1, r10
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r0
				; CHECK-NEXT: mov r3, r1
				; CHECK-NEXT: mov r0, r4
				; CHECK-NEXT: mov r1, r5
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: mov r2, r6
				; CHECK-NEXT: mov r3, r7
				; CHECK-NEXT: add sp, #8
				; CHECK-NEXT: vpop {d8, d9}
				; CHECK-NEXT: add sp, #4
				; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
				entry:
				%a.real = shufflevector <2 x double> %a, <2 x double> poison, <1 x i32> <i32 0>
				%a.imag = shufflevector <2 x double> %a, <2 x double> poison, <1 x i32> <i32 1>
				%b.real = shufflevector <2 x double> %b, <2 x double> poison, <1 x i32> <i32 0>
				%b.imag = shufflevector <2 x double> %b, <2 x double> poison, <1 x i32> <i32 1>
				%0 = fmul fast <1 x double> %b.imag, %a.real
				%1 = fmul fast <1 x double> %b.real, %a.imag
				%2 = fadd fast <1 x double> %1, %0
				%3 = fmul fast <1 x double> %b.real, %a.real
				%4 = fmul fast <1 x double> %a.imag, %b.imag
				%5 = fsub fast <1 x double> %3, %4
				%interleaved.vec = shufflevector <1 x double> %5, <1 x double> %2, <2 x i32> <i32 0, i32 1>
				ret <2 x double> %interleaved.vec
				}

				define <4 x double> @complex_mul_v4f64(<4 x double> %a, <4 x double> %b) #0 {
				; CHECK-LABEL: complex_mul_v4f64:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, r11, lr}
				; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
				; CHECK-NEXT: .pad #4
				; CHECK-NEXT: sub sp, #4
				; CHECK-NEXT: .vsave {d8, d9, d10, d11}
				; CHECK-NEXT: vpush {d8, d9, d10, d11}
				; CHECK-NEXT: .pad #56
				; CHECK-NEXT: sub sp, #56
				; CHECK-NEXT: str r0, [sp, #52] @ 4-byte Spill
				; CHECK-NEXT: add r0, sp, #152
				; CHECK-NEXT: vldrw.u32 q4, [r0]
				; CHECK-NEXT: strd r2, r3, [sp, #28] @ 8-byte Folded Spill
				; CHECK-NEXT: vmov r0, r1, d9
				; CHECK-NEXT: strd r0, r1, [sp, #44] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: mov r5, r1
				; CHECK-NEXT: vmov r0, r1, d8
				; CHECK-NEXT: ldrd r2, r3, [sp, #128]
				; CHECK-NEXT: strd r0, r1, [sp, #20] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r4
				; CHECK-NEXT: mov r3, r5
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: add r2, sp, #136
				; CHECK-NEXT: strd r1, r0, [sp, #36] @ 8-byte Folded Spill
				; CHECK-NEXT: add r1, sp, #168
				; CHECK-NEXT: vldrw.u32 q5, [r2]
				; CHECK-NEXT: vldrw.u32 q4, [r1]
				; CHECK-NEXT: vmov r9, r6, d10
				; CHECK-NEXT: vmov r0, r1, d9
				; CHECK-NEXT: mov r2, r9
				; CHECK-NEXT: mov r3, r6
				; CHECK-NEXT: strd r0, r1, [sp, #12] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: vmov r7, r5, d11
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: vmov r8, r10, d8
				; CHECK-NEXT: mov r11, r1
				; CHECK-NEXT: mov r2, r7
				; CHECK-NEXT: mov r3, r5
				; CHECK-NEXT: mov r0, r8
				; CHECK-NEXT: mov r1, r10
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r4
				; CHECK-NEXT: mov r3, r11
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: add r3, sp, #20
				; CHECK-NEXT: strd r1, r0, [sp, #4] @ 8-byte Folded Spill
				; CHECK-NEXT: ldm r3, {r0, r1, r2, r3} @ 16-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r11, r0
				; CHECK-NEXT: str r1, [sp, #32] @ 4-byte Spill
				; CHECK-NEXT: mov r0, r8
				; CHECK-NEXT: mov r1, r10
				; CHECK-NEXT: mov r2, r9
				; CHECK-NEXT: mov r3, r6
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: ldrd r2, r3, [sp, #12] @ 8-byte Folded Reload
				; CHECK-NEXT: mov r9, r0
				; CHECK-NEXT: mov r8, r1
				; CHECK-NEXT: mov r0, r7
				; CHECK-NEXT: mov r1, r5
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r5, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: ldrd r0, r1, [sp, #128]
				; CHECK-NEXT: ldrd r2, r3, [sp, #44] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: mov r6, r1
				; CHECK-NEXT: ldrd r1, r0, [sp, #4] @ 8-byte Folded Reload
				; CHECK-NEXT: mov r2, r5
				; CHECK-NEXT: mov r3, r7
				; CHECK-NEXT: vmov d9, r0, r1
				; CHECK-NEXT: mov r0, r9
				; CHECK-NEXT: mov r1, r8
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: vmov d8, r0, r1
				; CHECK-NEXT: ldrd r1, r0, [sp, #36] @ 8-byte Folded Reload
				; CHECK-NEXT: mov r2, r4
				; CHECK-NEXT: mov r3, r6
				; CHECK-NEXT: vmov d11, r0, r1
				; CHECK-NEXT: ldr r1, [sp, #32] @ 4-byte Reload
				; CHECK-NEXT: mov r0, r11
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: vmov d10, r0, r1
				; CHECK-NEXT: ldr r0, [sp, #52] @ 4-byte Reload
				; CHECK-NEXT: vstrw.32 q4, [r0, #16]
				; CHECK-NEXT: vstrw.32 q5, [r0]
				; CHECK-NEXT: add sp, #56
				; CHECK-NEXT: vpop {d8, d9, d10, d11}
				; CHECK-NEXT: add sp, #4
				; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
				entry:
				%a.real = shufflevector <4 x double> %a, <4 x double> poison, <2 x i32> <i32 0, i32 2>
				%a.imag = shufflevector <4 x double> %a, <4 x double> poison, <2 x i32> <i32 1, i32 3>
				%b.real = shufflevector <4 x double> %b, <4 x double> poison, <2 x i32> <i32 0, i32 2>
				%b.imag = shufflevector <4 x double> %b, <4 x double> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x double> %b.imag, %a.real
				%1 = fmul fast <2 x double> %b.real, %a.imag
				%2 = fadd fast <2 x double> %1, %0
				%3 = fmul fast <2 x double> %b.real, %a.real
				%4 = fmul fast <2 x double> %a.imag, %b.imag
				%5 = fsub fast <2 x double> %3, %4
				%interleaved.vec = shufflevector <2 x double> %5, <2 x double> %2, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x double> %interleaved.vec
				}

				define <8 x double> @complex_mul_v8f64(<8 x double> %a, <8 x double> %b) #0 {
				; CHECK-LABEL: complex_mul_v8f64:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, r10, r11, lr}
				; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
				; CHECK-NEXT: .pad #4
				; CHECK-NEXT: sub sp, #4
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: .pad #128
				; CHECK-NEXT: sub sp, #128
				; CHECK-NEXT: add r1, sp, #320
				; CHECK-NEXT: mov r9, r2
				; CHECK-NEXT: str r2, [sp, #48] @ 4-byte Spill
				; CHECK-NEXT: add r2, sp, #256
				; CHECK-NEXT: vldrw.u32 q4, [r1]
				; CHECK-NEXT: vldrw.u32 q5, [r2]
				; CHECK-NEXT: str r0, [sp, #124] @ 4-byte Spill
				; CHECK-NEXT: mov r8, r3
				; CHECK-NEXT: vmov r0, r10, d9
				; CHECK-NEXT: str r3, [sp, #52] @ 4-byte Spill
				; CHECK-NEXT: vmov r2, r3, d10
				; CHECK-NEXT: str r0, [sp, #120] @ 4-byte Spill
				; CHECK-NEXT: mov r1, r10
				; CHECK-NEXT: strd r2, r3, [sp, #112] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: vmov r2, r3, d11
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: mov r5, r1
				; CHECK-NEXT: vmov r0, r1, d8
				; CHECK-NEXT: strd r2, r3, [sp, #96] @ 8-byte Folded Spill
				; CHECK-NEXT: strd r0, r1, [sp, #72] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r4
				; CHECK-NEXT: mov r3, r5
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: mov r5, r1
				; CHECK-NEXT: add r2, sp, #240
				; CHECK-NEXT: add r1, sp, #304
				; CHECK-NEXT: vldrw.u32 q5, [r2]
				; CHECK-NEXT: vldrw.u32 q4, [r1]
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: vmov r2, r3, d10
				; CHECK-NEXT: vmov r0, r1, d9
				; CHECK-NEXT: strd r2, r3, [sp, #80] @ 8-byte Folded Spill
				; CHECK-NEXT: strd r0, r1, [sp, #104] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: vmov r2, r3, d11
				; CHECK-NEXT: mov r6, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: vmov r0, r1, d8
				; CHECK-NEXT: strd r2, r3, [sp, #64] @ 8-byte Folded Spill
				; CHECK-NEXT: strd r0, r1, [sp, #40] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r6
				; CHECK-NEXT: mov r3, r7
				; CHECK-NEXT: vmov d9, r4, r5
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: add r0, sp, #288
				; CHECK-NEXT: vldrw.u32 q5, [r0]
				; CHECK-NEXT: mov r5, r1
				; CHECK-NEXT: mov r2, r9
				; CHECK-NEXT: mov r3, r8
				; CHECK-NEXT: vmov r0, r1, d11
				; CHECK-NEXT: strd r0, r1, [sp, #88] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r6, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: vmov r0, r1, d10
				; CHECK-NEXT: ldrd r2, r3, [sp, #232]
				; CHECK-NEXT: strd r0, r1, [sp, #16] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r6
				; CHECK-NEXT: mov r3, r7
				; CHECK-NEXT: vmov d11, r4, r5
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: strd r1, r0, [sp, #32] @ 8-byte Folded Spill
				; CHECK-NEXT: add r2, sp, #272
				; CHECK-NEXT: add r1, sp, #336
				; CHECK-NEXT: vldrw.u32 q7, [r2]
				; CHECK-NEXT: vldrw.u32 q6, [r1]
				; CHECK-NEXT: vmov r2, r3, d14
				; CHECK-NEXT: vmov r0, r1, d13
				; CHECK-NEXT: strd r2, r3, [sp, #24] @ 8-byte Folded Spill
				; CHECK-NEXT: strd r0, r1, [sp, #56] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: vmov r8, r11, d12
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: vmov r2, r3, d15
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: mov r0, r8
				; CHECK-NEXT: mov r1, r11
				; CHECK-NEXT: strd r2, r3, [sp, #8] @ 8-byte Folded Spill
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r4
				; CHECK-NEXT: mov r3, r7
				; CHECK-NEXT: bl __aeabi_dadd
				; CHECK-NEXT: strd r1, r0, [sp] @ 8-byte Folded Spill
				; CHECK-NEXT: ldrd r0, r1, [sp, #72] @ 8-byte Folded Reload
				; CHECK-NEXT: ldrd r2, r3, [sp, #112] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: ldrd r0, r1, [sp, #96] @ 8-byte Folded Reload
				; CHECK-NEXT: mov r3, r10
				; CHECK-NEXT: ldr r2, [sp, #120] @ 4-byte Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r0
				; CHECK-NEXT: mov r3, r1
				; CHECK-NEXT: mov r0, r4
				; CHECK-NEXT: mov r1, r7
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: ldrd r0, r1, [sp, #40] @ 8-byte Folded Reload
				; CHECK-NEXT: ldrd r2, r3, [sp, #80] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r9, r0
				; CHECK-NEXT: mov r10, r1
				; CHECK-NEXT: ldrd r0, r1, [sp, #64] @ 8-byte Folded Reload
				; CHECK-NEXT: ldrd r2, r3, [sp, #104] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r0
				; CHECK-NEXT: mov r3, r1
				; CHECK-NEXT: mov r0, r9
				; CHECK-NEXT: mov r1, r10
				; CHECK-NEXT: vmov d8, r4, r7
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: mov r7, r1
				; CHECK-NEXT: ldrd r0, r1, [sp, #16] @ 8-byte Folded Reload
				; CHECK-NEXT: ldrd r2, r3, [sp, #48] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r6, r0
				; CHECK-NEXT: mov r5, r1
				; CHECK-NEXT: ldrd r0, r1, [sp, #232]
				; CHECK-NEXT: ldrd r2, r3, [sp, #88] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r0
				; CHECK-NEXT: mov r3, r1
				; CHECK-NEXT: ldrd r1, r0, [sp] @ 8-byte Folded Reload
				; CHECK-NEXT: vmov d10, r4, r7
				; CHECK-NEXT: vmov d15, r0, r1
				; CHECK-NEXT: ldrd r1, r0, [sp, #32] @ 8-byte Folded Reload
				; CHECK-NEXT: vmov d13, r0, r1
				; CHECK-NEXT: mov r0, r6
				; CHECK-NEXT: mov r1, r5
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: ldrd r2, r3, [sp, #24] @ 8-byte Folded Reload
				; CHECK-NEXT: mov r7, r0
				; CHECK-NEXT: mov r4, r1
				; CHECK-NEXT: mov r0, r8
				; CHECK-NEXT: mov r1, r11
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r5, r0
				; CHECK-NEXT: mov r6, r1
				; CHECK-NEXT: ldrd r0, r1, [sp, #8] @ 8-byte Folded Reload
				; CHECK-NEXT: ldrd r2, r3, [sp, #56] @ 8-byte Folded Reload
				; CHECK-NEXT: bl __aeabi_dmul
				; CHECK-NEXT: mov r2, r0
				; CHECK-NEXT: mov r3, r1
				; CHECK-NEXT: mov r0, r5
				; CHECK-NEXT: mov r1, r6
				; CHECK-NEXT: bl __aeabi_dsub
				; CHECK-NEXT: vmov d14, r0, r1
				; CHECK-NEXT: ldr r0, [sp, #124] @ 4-byte Reload
				; CHECK-NEXT: vmov d12, r7, r4
				; CHECK-NEXT: vstrw.32 q7, [r0, #48]
				; CHECK-NEXT: vstrw.32 q4, [r0, #32]
				; CHECK-NEXT: vstrw.32 q5, [r0, #16]
				; CHECK-NEXT: vstrw.32 q6, [r0]
				; CHECK-NEXT: add sp, #128
				; CHECK-NEXT: vpop {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: add sp, #4
				; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
				entry:
				%a.real = shufflevector <8 x double> %a, <8 x double> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%a.imag = shufflevector <8 x double> %a, <8 x double> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%b.real = shufflevector <8 x double> %b, <8 x double> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%b.imag = shufflevector <8 x double> %b, <8 x double> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x double> %b.imag, %a.real
				%1 = fmul fast <4 x double> %b.real, %a.imag
				%2 = fadd fast <4 x double> %1, %0
				%3 = fmul fast <4 x double> %b.real, %a.real
				%4 = fmul fast <4 x double> %a.imag, %b.imag
				%5 = fsub fast <4 x double> %3, %4
				%interleaved.vec = shufflevector <4 x double> %5, <4 x double> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				ret <8 x double> %interleaved.vec
				}

llvm/test/CodeGen/ARM/O3-pipeline.ll

	; RUN: llc -mtriple=arm -O3 -debug-pass=Structure < %s -o /dev/null 2>&1 \| grep -v "Verify generated machine code" \| FileCheck %s			; RUN: llc -mtriple=arm -O3 -debug-pass=Structure < %s -o /dev/null 2>&1 \| grep -v "Verify generated machine code" \| FileCheck %s

				dmgreenUnsubmitted Done Reply Inline Actions This looks like it was incorrectly added for this test. dmgreen: This looks like it was incorrectly added for this test.
	; REQUIRES: asserts			; REQUIRES: asserts

	; CHECK: ModulePass Manager			; CHECK: ModulePass Manager
	; CHECK-NEXT: Pre-ISel Intrinsic Lowering			; CHECK-NEXT: Pre-ISel Intrinsic Lowering
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	; CHECK-NEXT: Expand Atomic instructions			; CHECK-NEXT: Expand Atomic instructions
	; CHECK-NEXT: Simplify the CFG			; CHECK-NEXT: Simplify the CFG
	; CHECK-NEXT: Dominator Tree Construction			; CHECK-NEXT: Dominator Tree Construction
	Show All 30 Lines
	; CHECK-NEXT: Scalarize Masked Memory Intrinsics			; CHECK-NEXT: Scalarize Masked Memory Intrinsics
	; CHECK-NEXT: Expand reduction intrinsics			; CHECK-NEXT: Expand reduction intrinsics
	; CHECK-NEXT: Natural Loop Information			; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: TLS Variable Hoist			; CHECK-NEXT: TLS Variable Hoist
	; CHECK-NEXT: Scalar Evolution Analysis			; CHECK-NEXT: Scalar Evolution Analysis
	; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)			; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)
	; CHECK-NEXT: Function Alias Analysis Results			; CHECK-NEXT: Function Alias Analysis Results
	; CHECK-NEXT: Transform functions to use DSP intrinsics			; CHECK-NEXT: Transform functions to use DSP intrinsics
				; CHECK-NEXT: Complex Arithmetic Pass
	; CHECK-NEXT: Interleaved Access Pass			; CHECK-NEXT: Interleaved Access Pass
	; CHECK-NEXT: Type Promotion			; CHECK-NEXT: Type Promotion
	; CHECK-NEXT: CodeGen Prepare			; CHECK-NEXT: CodeGen Prepare
	; CHECK-NEXT: Dominator Tree Construction			; CHECK-NEXT: Dominator Tree Construction
	; CHECK-NEXT: Exception handling preparation			; CHECK-NEXT: Exception handling preparation
	; CHECK-NEXT: Merge internal globals			; CHECK-NEXT: Merge internal globals
	; CHECK-NEXT: Natural Loop Information			; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: Scalar Evolution Analysis			; CHECK-NEXT: Scalar Evolution Analysis
	▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM][CodeGen] Add support for complex deinterleavingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 438664

llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h

llvm/include/llvm/CodeGen/Passes.h

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/include/llvm/InitializePasses.h

llvm/lib/CodeGen/CMakeLists.txt

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp

llvm/lib/Target/ARM/ARMISelLowering.h

llvm/lib/Target/ARM/ARMISelLowering.cpp

llvm/lib/Target/ARM/ARMTargetMachine.cpp

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-mul.ll

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-mul.ll

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f64-add.ll

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f64-mul.ll

llvm/test/CodeGen/ARM/O3-pipeline.ll

[ARM][CodeGen] Add support for complex deinterleaving
ClosedPublic