This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
1
TargetTransformInfoImpl.h
3/3
InitializePasses.h
-
Transforms/
-
Scalar.h
-
Scalar/
2
ComplexArithmetic.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Passes/
-
PassBuilder.cpp
-
PassRegistry.def
-
Target/ARM/
-
ARM/
2/2
ARMTargetMachine.cpp
-
ARMTargetTransformInfo.h
1/3
ARMTargetTransformInfo.cpp
-
Transforms/Scalar/
-
Scalar/
-
CMakeLists.txt
1
ComplexArithmetic.cpp
-
test/
-
CodeGen/ARM/
-
ARM/
1/1
O3-pipeline.ll
-
Transforms/ComplexArithmetic/ARM/
-
ComplexArithmetic/
-
ARM/
-
complex-arithmetic-f16-add.ll
-
complex-arithmetic-f16-mul.ll
-
complex-arithmetic-f32-add.ll
-
complex-arithmetic-f32-mul.ll
-
complex-arithmetic-f64-mul.ll
-
complex-arithmetic-rotations-add.ll
-
complex-arithmetic-rotations-mul.ll

Differential D114174

[ARM][CodeGen] Add support for complex deinterleaving
ClosedPublic

Authored by NickGuy on Nov 18 2021, 10:29 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer
samtebbs

Commits

rGd52e2839f3b1: [ARM][CodeGen] Add support for complex deinterleaving

Summary

Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

mgabka added a subscriber: mgabka.Apr 14 2022, 3:25 AM

huntergr added inline comments.May 30 2022, 2:03 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
482 ↗	(On Diff #421520)	I can't find any place where rotations are dealt with besides assignment and equality checks of 0,90,180,270 -- might an enum be preferable? Or do you anticipate doing arithmetic with them in a follow-up patch?
690 ↗	(On Diff #421520)	I think it would be worthwhile to add a comment about what the interleaving represents -- that you're looking for a shuffle that takes separate vectors of real and imaginary parts and combines them before they are stored to memory (or returned in registers), and that this is just for matching per-lane operations instead of cross-lane (like a reduction). Or at least that's the behaviour I've observed when testing your patch with a loop like the following: #define LEN (512) float _Complex a[ LEN ]; float _Complex b[ LEN ]; float _Complex c[ LEN ]; void foo (void) { #pragma clang loop vectorize(enable) for (int i = 0; i < LEN; ++i) a[i] = b[i] * c[i]; }

Since the last patch, I've redesigned the high-level approach; It now crawls through the instruction graph to find opportunities for complex deinterleaving before attempting to perform said deinterleaving. Doing it this way allows us to short-circuit, bailing out and preventing the heavy work from being performed if something isn't supported.
This iteration also implements support for complex operations with multiple steps/operands (e.g. a[i] * b[i] * c[i])

Harbormaster completed remote builds in B171067: Diff 438664.Jun 21 2022, 6:25 AM

I haven't really looked to deeply into the meat of the pass yet - how it does the matching - but I had a chunk of comments for the rest.

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes? It would build up a graph of complex nodes to replace the original shuffle, providing that the leaves were all valid de-interleavings. At least on paper I feel we should just be able to perform the search without looking though a lot of uses (except for checking there are no extra uses, of course), and all the Unmatched instructions and what looks like matching of random unrelated snippets would be cleaned up.

llvm/lib/CodeGen/CMakeLists.txt
49 ↗	(On Diff #438664)	Formatting is off here
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
1–2 ↗	(On Diff #438664)	This looks like it got formatted incorrectly.
99 ↗	(On Diff #438664)	item -> Item
100 ↗	(On Diff #438664)	i -> I (or a better name, if I is already used)
149 ↗	(On Diff #438664)	Reciprocal Throughput is more common.
511 ↗	(On Diff #438664)	evaluateComplexDeinterleavingBasicBlock -> evaluateBasicBlock
991–992 ↗	(On Diff #438664)	It is better to structure things in a way where we decide whether to do something, then do it. Not get half way through doing it and decide we didn't want to in the end. In what ways would we expect createComplexDeinterleavingIR to return nullptr at the moment?
1017 ↗	(On Diff #438664)	Is this cost necessary at the moment, or will it always be profitable for the simple cases?
llvm/lib/Target/ARM/ARMISelLowering.cpp
21738–21742 ↗	(On Diff #438664)	Things like this can be a single if: if (!VTy \|\| VTy->getNumElements() * VTy->getScalarSizeInBits() != 128) return false; I presume the == 128 can be removed if we teach it how to split the vectors up?
21773 ↗	(On Diff #438664)	Can this be an assert instead.
llvm/lib/Target/ARM/ARMTargetMachine.cpp
5–1	This doesn't need to add the new brackets.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-mul.ll
32 ↗	(On Diff #421520)	Add arm_aapcs_vfpcc to any tests that take or return vectors. I think you can remove #0 too.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f64-mul.ll
2 ↗	(On Diff #438664)	It might be a little cleaner to add +fp64 for all these f64 tests.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-rotations-add.ll
1 ↗	(On Diff #421520)	MVE tests can go into llvm/test/CodeGen/Thumb2/mve-complex-xyz.ll. So long as they are all updated by the test script, that should be fine to keep them with the rest.

In D114174#3601236, @dmgreen wrote:

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?

While that would work "simply" enough for cases like a * b, more elaborate cases (e.g. (a * b) * c) would result in some ambiguity as to which add(mul, mul) pairs with which sub(mul, mul). This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?

Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like DAGTypeLegalizer::SplitVectorResult).

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Remaining comments will be addressed in a follow-up patch.

chill added a subscriber: chill.Jun 28 2022, 3:28 AM

chill added inline comments.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
136 ↗	(On Diff #438664)	Here and elsewhere `std::find` -> `llvm::find`
593 ↗	(On Diff #438664)	Why are we scanning the whole block? It looks to me (admittedly I don't understand what this pass does yet) we can just walk over the `Instructions` vector (in reverse, if the order matters), avoiding the quadratic complexity.
633 ↗	(On Diff #438664)	Here and in a few other places `for (Value *V : Op0->operands()) {`

Do we have some broad overview of the approach and the algorithm? It'd be a good idea put something like this in the description and eventually the commit message. (I searched for LLVM complex RFCs, buy couldn't find anything useful).

chill added inline comments.Jun 28 2022, 7:42 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
325 ↗	(On Diff #438664)	`return std::make_shared<ComplexDeinterleavingCompositeNode>(Operation);`
561 ↗	(On Diff #438664)	These could be, e.g.: static const int RealMask[] = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}; auto RealMaskRef = ArrayRef<int>(RealMask, ShufleMask.size()); with an assertion/bounds check. Good enough for 512-bit vectors with 16-bit elements, can be extended.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99 ↗	(On Diff #438664)	Shouldn't these be translated to a couple of `vcadd.f32` instructions, like in the previous test? And this amount of move instructions seems excessive.

chill added inline comments.Jun 28 2022, 7:46 AM

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99 ↗	(On Diff #438664)	And this amount of move instructions seems excessive. I guess MVE does not have sensible swizzling instructions.

In D114174#3601263, @NickGuy wrote:

In D114174#3601236, @dmgreen wrote:

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?

While that would work "simply" enough for cases like a * b, more elaborate cases (e.g. (a * b) * c) would result in some ambiguity as to which add(mul, mul) pairs with which sub(mul, mul). This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

Hmm, OK. I was thinking about a three way multiply when looking at that way of structuring the pass, but only one pattern and only ever on paper. I was worried that there were cases where it was ambiguous, but figured if it was it could always just try both possibilities. But I've not implemented it, it just sounded like an elegant way of treating this like slightly more complex pattern matcher, as opposed to all this findUnmatchedInstructions and looking through uses.

A broad overview of the algorithm would be useful, like Momchil mentioned, perhaps in the file description. I was trying to make a sketch before but got a bit lost in nested loops.

I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?

Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like DAGTypeLegalizer::SplitVectorResult).

A separate addition sounds good. Lets try and get something that we happy with and extend it out.

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Do you have a test-case where this happens? Cost modelling will probably be good to add in the long run, I'd just like to understand where it currently can make things worse.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99 ↗	(On Diff #438664)	Yeah this is expected from shuffles that MVE cannot handle very well. It would look a lot better either with loads that could become interleaving loads, or under AArch64 where better shuffles are available.

Addressed comments, and added a file description that attempts to explain what the pass does.

NickGuy added a child revision: D129066: [AArch64][CodeGen] Add AArch64 support for complex deinterleaving.Jul 4 2022, 2:56 AM

NickGuy added a child revision: D129067: [CodeGen] Complex deinterleaving vector splitting.Jul 4 2022, 2:58 AM

Harbormaster completed remote builds in B173522: Diff 442045.Jul 4 2022, 3:20 AM

NickGuy marked 15 inline comments as done.Jul 4 2022, 3:42 AM

NickGuy added inline comments.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
593 ↗	(On Diff #438664)	The order of the `CompositeNode`s is important, this was a holdover from before we were sorting the nodes.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99 ↗	(On Diff #438664)	Shouldn't these be translated to a couple of vcadd.f32 instructions, like in the previous test? Not in this case, no. The IR vectors are too wide to fit into actual vector registers, and the follow-up vector splitting patch (D129067) is restricted to splitting in half.

Some of the comments looked like they didn't get resolved, and the latest version is missing context.

It would be good to get to the point where the tests are in a good place and we can commit those, allowing us to just show just the differences here. They need to be in the right place first though.

NickGuy updated this revision to Diff 449946.Aug 4 2022, 6:00 AM

Harbormaster completed remote builds in B179253: Diff 449946.Aug 4 2022, 6:00 AM

I've pre-committed the tests, and redesigned the core algorthm to operate on pairs of operands, rather than analysing the uses of a given instruction. The vector splitting has also been implemented on the target side

The tests are good, but they currently only cover each of the types for vcadd and vcmla0/90. I think we need more to show all the edgecases that should or should not be matched. They are not needed for all types, but we should try and make examples of each.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
162 ↗	(On Diff #449946)	ConvergingI is the Root of the graph? Perhaps just call it RootValue.
528 ↗	(On Diff #449946)	What if the operands are FMul(A, A) and FMul (B, C)? I think it needs to be more precise about which nodes are which.
563 ↗	(On Diff #449946)	I don't think it makes sense to treat these backwards. We know the CR part should be real and the CI is imaginary. It would only be valid to treat them the other way around if there was some sort of shuffle added.
592 ↗	(On Diff #449946)	Does this need to be calling identifyNode on the third argument too? The one derived from UncommonValues.
655 ↗	(On Diff #449946)	This looks like it needs to match: RealI and ImagI are both shuffles. They both have the same Operand0 They both have "deinterleaving" masks. I don't think the type of the value of Operand0 otherwise matters. It doesn't matter if it is a Load or a Argument, we can always just use it.
748 ↗	(On Diff #449946)	It's best not to use static data like this, we can make it more generic. The match can be `m_Shuffle(..., m_Mask(Mask))`, then check that the Mask is an isDeinterleavingMask. It does need to check _which_ deinterleaving mask it is though for the Real/Imaginary parts.
853 ↗	(On Diff #449946)	If the Operands that CN depends on are included in the Node, then we can just walk up the tree making sure we create the Operands before the Nodes that use them, using the Value's that the operands produce as the Input0/Input1/Accumulator below. That avoids the need to "wrangle" any inputs at this stage, as we already know the nodes we need.
868 ↗	(On Diff #449946)	It might be simpler to remove anything Cost based from this revision, adding it back in if it is needed in followup patches. All the identification that happens in this patch should always be cheaper, as far as I understand.

Addressed comments and improved case coverage,

Harbormaster completed remote builds in B185092: Diff 457993.Sep 5 2022, 7:33 AM

Some of this seems to have returned to a previous version of the patch?

llvm/lib/CodeGen/CMakeLists.txt
49 ↗	(On Diff #438664)	This formatting again?
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
431 ↗	(On Diff #457993)	Similar to identifyPartialMul, this probably needs to be more careful about what it is selecting as which operands.
568–570 ↗	(On Diff #457993)	Is it valid to test the operands in the opposite order?
623 ↗	(On Diff #457993)	Should these be checking the subnodes? if (!identifyNode(..)) return nullptr;
646 ↗	(On Diff #457993)	CommonOperandI only seems to be important between identifyPartialMul and identifyNodeWithImplicitAdd
679–680 ↗	(On Diff #457993)	These need to check that the first is the real deinterleave with offset=0, and the imag has offset=1. And maybe that they only take elements from the first operand and don't change size.
723–731 ↗	(On Diff #457993)	I feel that this should be correct by construction. What cases are not correct?
790–792 ↗	(On Diff #457993)	I don't think this should be needed. The inputs should just be present from the ReplacementNode of the operands.
810 ↗	(On Diff #457993)	If the statistic is being awkward then it is probably not worth keeping (or keeping simple - just counting number of transforms, not the number of individual intrinsics that might become in the backend).
528 ↗	(On Diff #449946)	I still think this needs to be a more precise with regard to what is considered the CommonOperand and CommonOperand.
llvm/lib/Target/ARM/ARMISelLowering.cpp
21817 ↗	(On Diff #457993)	It is better not to use fixed sized arrays like this, just construct the array to be any needed size. If the values are continuous you can use iota.
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll
2 ↗	(On Diff #416915)	These tests seem to have moved back to the wrong place.

Addressed comments

Harbormaster completed remote builds in B185432: Diff 458481.Sep 7 2022, 9:41 AM

OK great. It would be good to see a lot of test cases for pretty much anything that might go wrong. Things like the shuffles with incorrect masks, fadd's where there should be fsubs, patterns that cross basic blocks, any of the conditions that can currently return false that we can test.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
127 ↗	(On Diff #458481)	Can we just use nullptr as the sentinel value for IsValid = false, removing the need for this struct? If we need some way to represent the last shuffle node then we could add a Shuffle or Leaf node type (which could itself have a ReplacementNode just set to the shuffle->operand(0)). Hopefully that also simplifies resolveInputs / Elevate. Especially if Accumulator is treated in the same way as an Operand. Hopefully that is simpler.
437 ↗	(On Diff #458481)	Should this function be calling identifyNode on the inputs, similar to identifyPartialMul?
472 ↗	(On Diff #458481)	When can this be false?
589 ↗	(On Diff #458481)	I think I was expecting Operands to be the ComplexDeinterleavingCompositeNode* from identifyNode. That avoids the need to find the again later.
716 ↗	(On Diff #458481)	These seem to do recursive checks into the operands, but that is already being done again in identifyNode. Can we just remove them and rely on identifyNode?
775–776 ↗	(On Diff #458481)	This isn't used any more, which is good (I think there might be a better way to get a TTI if it was needed).
786–787 ↗	(On Diff #458481)	When can this return false?

Forgot to submit the comments adjoining the patch...

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
786–787 ↗	(On Diff #458481)	It can be false if a shuffle hasn't been traversed properly. In hindsight though, this could be an assertion instead as it indicates a problem with the node construction.
568–570 ↗	(On Diff #457993)	No, but it's also not invalid (the checks do what they're supposed to and don't produce a node). I've removed these.
646 ↗	(On Diff #457993)	It used to be needed all through, but that is no longer the case. Changed the signatures and such to reflect this.
723–731 ↗	(On Diff #457993)	The one that is easiest to see is when attempting to multiply by a value rotated by 270 (`a[i] * (b[i] * III)`), it produces the following IR which has the real and imaginary components reversed at %interleaved.vec %a.real = shufflevector <8 x float> %a, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %a.imag = shufflevector <8 x float> %a, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %b.real = shufflevector <8 x float> %b, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %b.imag = shufflevector <8 x float> %b, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7> %0 = fmul fast <4 x float> %b.imag, %a.imag %1 = fmul fast <4 x float> %b.real, %a.real %2 = fsub fast <4 x float> %0, %1 %3 = fmul fast <4 x float> %b.imag, %a.real %4 = fmul fast <4 x float> %b.real, %a.imag %5 = fadd fast <4 x float> %4, %3 %interleaved.vec = shufflevector <4 x float> %5, <4 x float> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
790–792 ↗	(On Diff #457993)	It's more of a side effect of analysing the whole tree before actually performing the replacement (which we do in case the tree can't be transformed). With some of the recent changes however, this function can't return false so it's merely a step of the replacement now.
810 ↗	(On Diff #457993)	Changed to track the transforms rather than individual intrinsics.

NickGuy updated this revision to Diff 459432.Sep 12 2022, 5:58 AM

NickGuy marked 7 inline comments as done.

NickGuy added inline comments.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
437 ↗	(On Diff #458481)	Not here, no. The inputs need to be resolved as part of the whole pair, so the relevant one for the created node is passed back via CommonOperandI.
472 ↗	(On Diff #458481)	In it's current state, never. It was possible before moving to the implicitAdd approach
775–776 ↗	(On Diff #458481)	Good catch, removed.

NickGuy added inline comments.Sep 12 2022, 5:58 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
589 ↗	(On Diff #458481)	Attempting to plumb something together that acceps both `ComplexDeinterleavingCompositeNode` and `Value` caused more bloat and made things more difficult to follow. And we can't simply use `Node->ReplacementNode` as the `Value*` because it hasn't been assigned yet. Because of that constraint, I'm opting to find the nodes later, though I'm open to ideas.

Harbormaster completed remote builds in B186131: Diff 459432.Sep 12 2022, 5:59 AM

dmgreen added inline comments.Sep 13 2022, 2:22 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
168 ↗	(On Diff #459432)	Why is this needed? When are the Real and Imag values not already correct? It seems strange to be able to reverse them.
736–737 ↗	(On Diff #459432)	Can we move this into identifyNodes (or maybe at this point there will always be nodes?) That would make replaceNodes always return true, which simplifies the getDeadRoots a little too.
437 ↗	(On Diff #458481)	What about the other pair? They are not guaranteed to be the same as the operands from identifyPartialMul as far as I can see. I would probably make sure that the two operands are a valid match too. The second call to identifyNode with the same operands should just find the same node again. And that way this can have proper operands. Currently this and identifyPartialMul are very much interlinked. I hope we can improve that in the future, but it sounds like an extension that can be thought about later.
589 ↗	(On Diff #458481)	I was expecting it to just accept ComplexDeinterleavingCompositeNode. If all the operands are ComplexDeinterleavingCompositeNode, then they just need to be visited in the correct order to assign ReplacementNode based on how they need to be transformed. (Which I believe is fine providing we visit them in reverse order, like is already done). i.e. we construct a graph made up of ComplexDeinterleavingCompositeNode, then transform that graph. That seems like a simpler, more extensible design going forward. It gets more complicated if there is a mixing of values between the original IR and the newly constructed intrinsics. I could imagine that might make graphs (as opposed to DAGs) more difficult, but we don't yet support any cycles.
723–731 ↗	(On Diff #457993)	Sorry - I'm not sure what this was referring to now! Do you think that example should be transforming into complex intrinsics? It doesn't look valid to me. Have you added it as a test case?

Addressed comments, and redesigned the lookup list of nodes to now be represented as a graph.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
168 ↗	(On Diff #459432)	It was a holdover function that I reused for the `.Valid` check. It's not used anywhere else so I've removed it (and the relevant struct), and moved the relevant check to here.
589 ↗	(On Diff #458481)	I've converted the pseudo-graph (a list with lookups) to an actual graph structure, with every node being a ComplexDeinterleavingCompositeNode, meaning that we're no longer mixing `Value`s and `ComplexDeinterleavingCompositeNode*`s.

Harbormaster completed remote builds in B187920: Diff 461829.Sep 21 2022, 2:52 AM

Have you had any luck adding a larger corpus of tests cases yet? I see that uses have been added to the graph. I would hope that we could avoid that complication, but it is hard to see if we can or not without a larger set of tests.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
761 ↗	(On Diff #461829)	Try not to overuse auto where the type isn't already obvious.

Added more interesting test cases, covering triangle and diamond multiply cases.

Harbormaster completed remote builds in B189165: Diff 463541.Sep 28 2022, 7:04 AM

Thanks for adding all the extra tests. Here is another one, that is useful in itself, but more useful when we break it. It does a*b +90 a*c, and seems to work well for this example.

define arm_aapcs_vfpcc <4 x float> @test(<4 x float> %a, <4 x float> %b, <4 x float> %c) {
entry:
  %ar = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ai = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %br = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %bi = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %cr = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ci = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3>

  %i6 = fmul fast <2 x float> %br, %ar
  %i7 = fmul fast <2 x float> %bi, %ai
  %xr = fsub fast <2 x float> %i6, %i7
  %i9 = fmul fast <2 x float> %bi, %ar
  %i10 = fmul fast <2 x float> %br, %ai
  %xi = fadd fast <2 x float> %i9, %i10

  %j6 = fmul fast <2 x float> %cr, %ar
  %j7 = fmul fast <2 x float> %ci, %ai
  %yr = fsub fast <2 x float> %j6, %j7
  %j9 = fmul fast <2 x float> %ci, %ar
  %j10 = fmul fast <2 x float> %cr, %ai
  %yi = fadd fast <2 x float> %j9, %j10

  %zr = fsub fast <2 x float> %yr, %xi
  %zi = fadd fast <2 x float> %yi, %xr
  %interleaved.vec = shufflevector <2 x float> %zr, <2 x float> %zi, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

But this is a modification that alters the c*a to use part of the b*a mul's. It shouldn't be being transformed as it is, I don't believe.

define arm_aapcs_vfpcc <4 x float> @mul_triangle_addmul(<4 x float> %a, <4 x float> %b, <4 x float> %c) {
entry:
  %ar = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ai = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %br = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %bi = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %cr = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ci = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3>

  %i6 = fmul fast <2 x float> %br, %ar
  %i7 = fmul fast <2 x float> %bi, %ai
  %xr = fsub fast <2 x float> %i6, %i7
  %i9 = fmul fast <2 x float> %bi, %ar
  %i10 = fmul fast <2 x float> %br, %ai
  %xi = fadd fast <2 x float> %i9, %i10

  ;%j6 = fmul fast <2 x float> %cr, %ar
  %j7 = fmul fast <2 x float> %ci, %ai
  %yr = fsub fast <2 x float> %i6, %j7
  ;%j9 = fmul fast <2 x float> %ci, %ar
  %j10 = fmul fast <2 x float> %cr, %ai
  %yi = fadd fast <2 x float> %i9, %j10

  %zr = fsub fast <2 x float> %yr, %xi
  %zi = fadd fast <2 x float> %yi, %xr
  %interleaved.vec = shufflevector <2 x float> %zr, <2 x float> %zi, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

The Incomplete nodes worry me as it looks like a rich source of bugs. If the identifyPartialMul and identifyNodeWithImplicitAdd need to work together more closely for the time being, that is probably fine. We can always change it in the future if needed.

Some other issues are that we need to check for multiple uses. As in something like

define arm_aapcs_vfpcc <4 x float> @mul_triangle_multiuses(<4 x float> %a, <4 x float> %b, ptr %p) {
; CHECK-LABEL: mul_triangle_multiuses:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    .vsave {d14}
; CHECK-NEXT:    vpush {d14}
; CHECK-NEXT:    .vsave {d10, d11, d12}
; CHECK-NEXT:    vpush {d10, d11, d12}
; CHECK-NEXT:    .vsave {d8}
; CHECK-NEXT:    vpush {d8}
; CHECK-NEXT:    vmov q2, q0
; CHECK-NEXT:    vmov.f32 s16, s4
; CHECK-NEXT:    vmov.f32 s17, s6
; CHECK-NEXT:    vmov.i32 q0, #0x0
; CHECK-NEXT:    vmov.f32 s20, s9
; CHECK-NEXT:    vmov.f32 s21, s11
; CHECK-NEXT:    vmov.f32 s28, s5
; CHECK-NEXT:    vmul.f32 q3, q5, q4
; CHECK-NEXT:    vmov.f32 s29, s7
; CHECK-NEXT:    vmul.f32 q5, q7, q5
; CHECK-NEXT:    vmov.f32 s24, s8
; CHECK-NEXT:    vmov.f32 s25, s10
; CHECK-NEXT:    vneg.f32 q5, q5
; CHECK-NEXT:    vfma.f32 q3, q7, q6
; CHECK-NEXT:    vfma.f32 q5, q4, q6
; CHECK-NEXT:    vmov.f32 s22, s12
; CHECK-NEXT:    vmov.f32 s23, s13
; CHECK-NEXT:    vmov q3, q0
; CHECK-NEXT:    vcmla.f32 q3, q1, q2, #0
; CHECK-NEXT:    vstrw.32 q5, [r0]
; CHECK-NEXT:    vcmla.f32 q3, q1, q2, #90
; CHECK-NEXT:    vcmla.f32 q0, q2, q3, #0
; CHECK-NEXT:    vcmla.f32 q0, q2, q3, #90
; CHECK-NEXT:    vpop {d8}
; CHECK-NEXT:    vpop {d10, d11, d12}
; CHECK-NEXT:    vpop {d14}
; CHECK-NEXT:    bx lr
entry:
  %strided.vec = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %strided.vec35 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %strided.vec37 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %strided.vec38 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %0 = fmul fast <2 x float> %strided.vec37, %strided.vec
  %1 = fmul fast <2 x float> %strided.vec38, %strided.vec35
  %2 = fsub fast <2 x float> %0, %1
  %3 = fmul fast <2 x float> %2, %strided.vec35
  %4 = fmul fast <2 x float> %strided.vec38, %strided.vec
  %5 = fmul fast <2 x float> %strided.vec35, %strided.vec37
  %6 = fadd fast <2 x float> %4, %5
  %otheruse = shufflevector <2 x float> %2, <2 x float> %6, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  store <4 x float> %otheruse, ptr %p
  %7 = fmul fast <2 x float> %6, %strided.vec
  %8 = fadd fast <2 x float> %3, %7
  %9 = fmul fast <2 x float> %2, %strided.vec
  %10 = fmul fast <2 x float> %6, %strided.vec35
  %11 = fsub fast <2 x float> %9, %10
  %interleaved.vec = shufflevector <2 x float> %11, <2 x float> %8, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

And we probably need to check for fast-math flags, where we are generating fma. I also think that it is safer if nodes are uniquely identified from {real, imag} root pairs, not just from nodes that might contain either real or imag somewhere in them.

llvm/test/CodeGen/Thumb2/complex-deinterleaving-uniform-cases.ll
1 ↗	(On Diff #463541)	All the mve tests start with mve-

Thanks @dmgreen for the extra tests.

In D114174#3826540, @dmgreen wrote:

...
The Incomplete nodes worry me as it looks like a rich source of bugs. If the identifyPartialMul and identifyNodeWithImplicitAdd need to work together more closely for the time being, that is probably fine. We can always change it in the future if needed.
...

I've removed the late lookup/replacement of incomplete nodes, and moved it closer to where the incomplete nodes are created. I've also made it only check against a single node, one that we know should match

...
And we probably need to check for fast-math flags, where we are generating fma. I also think that it is safer if nodes are uniquely identified from {real, imag} root pairs, not just from nodes that might contain either real or imag somewhere in them.
...

Done, we're now checking the fast-math flags (specifically contract)

Harbormaster completed remote builds in B191749: Diff 467159.Oct 12 2022, 8:38 AM

dmgreen added inline comments.Oct 14 2022, 10:54 AM

llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h
36 ↗	(On Diff #467159)	I don't think None is ever used, and from what I can see _Placeholder is a Shuffle? That might be a better name, either Shuffle or Leaf. I would also probably remove _Incomplete. They might make sense, but it is difficult to follow and very easy to get wrong.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
105 ↗	(On Diff #467159)	I think it would be best to store the root Real and Imag values as part of the ComplexDeinterleavingCompositeNode. That way we can be sure we are matching a correct node from anything like getContainingComposite. There can then be the TransientInstructions (maybe "ExtraInstructions"), which are also part of the pattern, but not the root nodes.
114 ↗	(On Diff #467159)	Could AccumulatorNode just be another Operand?
115 ↗	(On Diff #467159)	I think this would be best to avoid NodePtr, if that might keep a reference to a node alive. Avoiding shared_ptr entirely would be good, but I imagine any alternative might be difficult too. They needs to be stable values to avoid memory invalidation issues.
170 ↗	(On Diff #467159)	Formatting needs updating.
219 ↗	(On Diff #467159)	If NodeInstructions is used to calculate all the Instructions we have seen, to check that all the uses are inside the graph, it could be a SmallPtrSet which will have quicker contains().
287 ↗	(On Diff #467159)	It is hard to see how wrapping a Value as an _Incomplete node is always correct. It will have lost the information about whether V was real or imaginary.
575 ↗	(On Diff #467159)	Do you have any tests for fneg? This should probably be more like the code in identifyPartialMul where the rotation is chosen based on whether the Real/Imag parts are negated.
672 ↗	(On Diff #467159)	It's not the combination of rotates that is invalid really, although some might make less sense than others, and some might simplify to code that is difficult to match. The even rotations will define the real part of `a`, the odd rotations define the imag part. Between the two parts of the multiply we match we need to find both the real and imag halves of the value, to successfully match it further. (We may be able to do something with only half a match in some cases, like matching leafs, but that should probably be left for a later patch).
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
79 ↗	(On Diff #467159)	Are you sure this is valid? When we were doing the gather lowering, we found it useful to annotate the tests with OK/Bad, so if they change we could see the ones we thought shouldn't.

Addressed comments

Harbormaster completed remote builds in B193506: Diff 469583.Oct 21 2022, 6:47 AM

Thanks for the updates. This is looking good now.

There are some extra comments below. I was trying some examples that have awkward orders and cross basicblocks, but couldn't find ways to make it break.

llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h
39–42 ↗	(On Diff #469583)	This can now be re-flowed.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
19 ↗	(On Diff #469583)	Can this comment be updated now?
559 ↗	(On Diff #469583)	These can be // comments. They are very useful, but maybe it is not necessary to repeat them.
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
4 ↗	(On Diff #469583)	Are there any tests with missing fast-math flags?
5 ↗	(On Diff #469583)	I managed to come up with this testcase that failed because of the vector size. It is probably larger than it needs to be. It could work or not, depending if the backend handles non-power-2 sizes. In either rate, it would be good to add this example. (Once integers are added too, some odd size integer width tests would be good too). define arm_aapcs_vfpcc <12 x float> @abp90c12(<12 x float> %a, <12 x float> %b, <12 x float> %c) { entry: %ar = shufflevector <12 x float> %a, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %ai = shufflevector <12 x float> %a, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %br = shufflevector <12 x float> %b, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %bi = shufflevector <12 x float> %b, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %cr = shufflevector <12 x float> %c, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %ci = shufflevector <12 x float> %c, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %i6 = fmul fast <6 x float> %br, %ar %i7 = fmul fast <6 x float> %bi, %ai %xr = fsub fast <6 x float> %i6, %i7 %i9 = fmul fast <6 x float> %bi, %ar %i10 = fmul fast <6 x float> %br, %ai %xi = fadd fast <6 x float> %i9, %i10 %zr = fsub fast <6 x float> %cr, %xi %zi = fadd fast <6 x float> %ci, %xr %interleaved.vec = shufflevector <6 x float> %zr, <6 x float> %zi, <12 x i32> <i32 0, i32 6, i32 1, i32 7, i32 2, i32 8, i32 3, i32 9, i32 4, i32 10, i32 5, i32 11> ret <12 x float> %interleaved.vec }
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
79 ↗	(On Diff #469583)	Is this just not transforming because of commutativity on the add? Can we add a test for `fadd fast <2 x float> %strided.vec20, %strided.vec` too, to show it does transform.

Came here to say that this looks nice and would be worthy a mention in the release notes, so that can be added, I think.

Then I started reading the patch again a bit and couldn't help myself writing down some nits.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
16 ↗	(On Diff #469583)	Nit: perhaps just omit `in parallel`? Or if you want to keep it, it could benefit from a clarification.
25 ↗	(On Diff #469583)	I wouldn't mind a few more words/sentences about the internal datastructures, the graph, nodes, etc.
50 ↗	(On Diff #469583)	Nit: remove `deinterleaving` from this description?
59 ↗	(On Diff #469583)	Nit: newline
109 ↗	(On Diff #469583)	Can you say what these instructions could be?
173 ↗	(On Diff #469583)	I haven't looked carefully, but why do we also need to record all instructions here? I mean, the graph can be queried for existence of nodes/instructions, or is this bookkeeping duplication more efficient?
181 ↗	(On Diff #469583)	Nit: no validation going on here, only insertion?
181 ↗	(On Diff #469583)	Nit: no validation going on here, only insertion?
239 ↗	(On Diff #469583)	Nit: newline?
312 ↗	(On Diff #469583)	Nit: can omit the comparison with 1.
343 ↗	(On Diff #469583)	Nit: I am a fan of reducing indentation, e.g.: auto *SVI = dyn_cast<ShuffleVectorInst>(&I); if (!SVI) continue; if (!isInterleavingMask(SVI->getShuffleMask()) continue; etc.
367 ↗	(On Diff #469583)	Nit: maybe I overlooked something, but this doesn't seem to use any state from the class so could be just a helper function?
467 ↗	(On Diff #469583)	Nit: I am wondering if these casts here and below are necessary.
483 ↗	(On Diff #469583)	nit: maybe some or all of these node setting things can be done in a constructor.
539 ↗	(On Diff #469583)	Nits: Op -> Operand, not Instruction -> not a Instruction?

NickGuy updated this revision to Diff 473955.Nov 8 2022, 4:10 AM

NickGuy marked 23 inline comments as done.

NickGuy added inline comments.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
173 ↗	(On Diff #469583)	The duplication makes it simpler to check the internal uses of a node (See `ComplexDeinterleavingCompositeNode::hasAllInternalUses`). Without it, we'd need to iterate over the nodes twice, and check the instructions of each node (All of `Real`, `Imag`, and the contents of `InternalInstructions`), while this approach allows us to simply say "Is this instruction known within the graph?".
181 ↗	(On Diff #469583)	Good catch, the validation that was here was moved to be more tied into the identification. Renamed
367 ↗	(On Diff #469583)	It populates the compositeNode structure through (formerly) `validateAndSubmitCompositeNode`, and also calls `identifyNode` further, which does use state from the class.
467 ↗	(On Diff #469583)	You are correct, good catch.
483 ↗	(On Diff #469583)	Not sure I agree with that, there are cases - such as a `Shuffle` - where none of the other settings are relevant. The operation type, real instruction, and imaginary instructions are the only settings relevant to all cases.
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
5 ↗	(On Diff #469583)	Fixed this case (by checking on the backend whether the vector width is a power of 2), and added the test.
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
79 ↗	(On Diff #469583)	Yep, it's the commutativity. I've added a note to this test stating this, and added another test with the fadd operands reversed.

Harbormaster completed remote builds in B196683: Diff 473955.Nov 8 2022, 4:10 AM

NickGuy mentioned this in D129066: [AArch64][CodeGen] Add AArch64 support for complex deinterleaving.Nov 8 2022, 5:40 AM

dmgreen added inline comments.Nov 10 2022, 9:19 AM

llvm/include/llvm/InitializePasses.h
1	You can undo these
1	There is a newline here?
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
412 ↗	(On Diff #473955)	If you can turn some of these into better sentances that would be good. Perhaps "Mul instruction has multiple uses" like you have below.
481 ↗	(On Diff #473955)	Formatting is a bit long.
llvm/lib/Target/ARM/ARMTargetMachine.cpp
5–1	This again.
llvm/test/CodeGen/ARM/O3-pipeline.ll
-10	This looks like it was incorrectly added for this test.

khchen added a subscriber: khchen.Nov 10 2022, 10:06 AM

NickGuy updated this revision to Diff 474743.Nov 11 2022, 6:39 AM

NickGuy marked 6 inline comments as done.

Harbormaster completed remote builds in B197217: Diff 474743.Nov 11 2022, 6:39 AM

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

llvm/include/llvm/InitializePasses.h
1	You can undo these.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
590 ↗	(On Diff #474743)	Formatting.
llvm/lib/Target/ARM/ARMISelLowering.cpp
21795 ↗	(On Diff #474743)	Formatting.
21803 ↗	(On Diff #474743)	Add a message to the assert

This revision is now accepted and ready to land.Nov 14 2022, 5:39 AM

This revision was landed with ongoing or failed builds.Nov 14 2022, 6:04 AM

Closed by commit rGd52e2839f3b1: [ARM][CodeGen] Add support for complex deinterleaving (authored by NickGuy). · Explain Why

This revision was automatically updated to reflect the committed changes.

NickGuy marked 4 inline comments as done.

NickGuy added a commit: rGd52e2839f3b1: [ARM][CodeGen] Add support for complex deinterleaving.

In D114174#3924650, @dmgreen wrote:

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

Agreed, and thanks for working on this.

Question about AArch64, is that something you are going to work on next?

In D114174#3924782, @SjoerdMeijer wrote:

In D114174#3924650, @dmgreen wrote:

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

Agreed, and thanks for working on this.

Question about AArch64, is that something you are going to work on next?

Yep, over at D129066. It's currently slightly behind this, but is mostly there already.

NickGuy mentioned this in D142012: [ReleaseNotes] Adds mention of complex number support for ARM and AArch64 backends..Jan 18 2023, 6:11 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

	TargetTransformInfo.h
	TargetTransformInfo.h

30 lines

	TargetTransformInfoImpl.h
	TargetTransformInfoImpl.h

10 lines

	InitializePasses.h
	InitializePasses.h

7 lines

Transforms/

	Scalar.h
	Scalar.h

11 lines

lib/

Analysis/

	TargetTransformInfo.cpp
	TargetTransformInfo.cpp

17 lines

Passes/

	PassBuilder.cpp
	PassBuilder.cpp

1 line

	PassRegistry.def
	PassRegistry.def

1 line

Target/

ARM/

	ARMTargetMachine.cpp
	ARMTargetMachine.cpp

9 lines

	ARMTargetTransformInfo.h
	ARMTargetTransformInfo.h

5 lines

	ARMTargetTransformInfo.cpp
	ARMTargetTransformInfo.cpp

169 lines

Transforms/

Scalar/

	CMakeLists.txt
	CMakeLists.txt

1 line

test/

CodeGen/

ARM/

	O3-pipeline.ll
	O3-pipeline.ll

1 line

	llvm/

	include/	llvm/	Transforms/	Scalar/
				dev/

	ComplexArithmetic.h
	null

252 lines

	lib/	Transforms/	Scalar/
			dev/

	ComplexArithmetic.cpp
	null

486 lines

	test/	Transforms/	ComplexArithmetic/	ARM/
				dev/

	complex-arithmetic-f16-add.ll
	null

147 lines

	complex-arithmetic-f16-mul.ll
	null

194 lines

	complex-arithmetic-f32-add.ll
	null

173 lines

	complex-arithmetic-f32-mul.ll
	null

197 lines

	complex-arithmetic-f64-mul.ll
	null

256 lines

	complex-arithmetic-rotations-add.ll
	null

71 lines

	complex-arithmetic-rotations-mul.ll
	null

183 lines

Diff 406409

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show All 37 Lines
typedef unsigned ID;		typedef unsigned ID;
}		}

class AssumptionCache;		class AssumptionCache;
class BlockFrequencyInfo;		class BlockFrequencyInfo;
class DominatorTree;		class DominatorTree;
class BranchInst;		class BranchInst;
class CallBase;		class CallBase;
		class ComplexArithmeticGraph;
class Function;		class Function;
class GlobalValue;		class GlobalValue;
class InstCombiner;		class InstCombiner;
class OptimizationRemarkEmitter;		class OptimizationRemarkEmitter;
class IntrinsicInst;		class IntrinsicInst;
class LoadInst;		class LoadInst;
class LoopAccessInfo;		class LoopAccessInfo;
class Loop;		class Loop;
▲ Show 20 Lines • Show All 703 Lines • ▼ Show 20 Lines	public:
InstructionCost getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		InstructionCost getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
ArrayRef<Type *> Tys) const;		ArrayRef<Type *> Tys) const;

/// If target has efficient vector element load/store instructions, it can		/// If target has efficient vector element load/store instructions, it can
/// return true here so that insertion/extraction costs are not added to		/// return true here so that insertion/extraction costs are not added to
/// the scalarization cost of a load/store.		/// the scalarization cost of a load/store.
bool supportsEfficientVectorElementLoadStore() const;		bool supportsEfficientVectorElementLoadStore() const;

		bool supportsComplexArithmetic() const;

		Value createComplexArithmeticIR(ComplexArithmeticGraph &G, Value InputA,
		Value *InputB, int &GeneratedIntrinsicCount);

		bool matchComplexArithmeticIR(Instruction *I, ComplexArithmeticGraph &G);

/// Don't restrict interleaved unrolling to small loops.		/// Don't restrict interleaved unrolling to small loops.
bool enableAggressiveInterleaving(bool LoopHasReductions) const;		bool enableAggressiveInterleaving(bool LoopHasReductions) const;

/// Returns options for expansion of memcmp. IsZeroCmp is		/// Returns options for expansion of memcmp. IsZeroCmp is
// true if this is the expansion of memcmp(p1, p2, s) == 0.		// true if this is the expansion of memcmp(p1, p2, s) == 0.
struct MemCmpExpansionOptions {		struct MemCmpExpansionOptions {
// Return true if memcmp expansion is enabled.		// Return true if memcmp expansion is enabled.
operator bool() const { return MaxNumLoads > 0; }		operator bool() const { return MaxNumLoads > 0; }
▲ Show 20 Lines • Show All 804 Lines • ▼ Show 20 Lines	public:
virtual InstructionCost getScalarizationOverhead(VectorType *Ty,		virtual InstructionCost getScalarizationOverhead(VectorType *Ty,
const APInt &DemandedElts,		const APInt &DemandedElts,
bool Insert,		bool Insert,
bool Extract) = 0;		bool Extract) = 0;
virtual InstructionCost		virtual InstructionCost
getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
ArrayRef<Type *> Tys) = 0;		ArrayRef<Type *> Tys) = 0;
virtual bool supportsEfficientVectorElementLoadStore() = 0;		virtual bool supportsEfficientVectorElementLoadStore() = 0;
		virtual bool supportsComplexArithmetic() const = 0;
		virtual Value *createComplexArithmeticIR(ComplexArithmeticGraph &G,
		Value InputA, Value InputB,
		int &GeneratedIntrinsicCount) = 0;
		virtual bool matchComplexArithmeticIR(Instruction *I,
		ComplexArithmeticGraph &G) = 0;
virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;		virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;
virtual MemCmpExpansionOptions		virtual MemCmpExpansionOptions
enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const = 0;		enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const = 0;
virtual bool enableInterleavedAccessVectorization() = 0;		virtual bool enableInterleavedAccessVectorization() = 0;
virtual bool enableMaskedInterleavedAccessVectorization() = 0;		virtual bool enableMaskedInterleavedAccessVectorization() = 0;
virtual bool isFPVectorizationPotentiallyUnsafe() = 0;		virtual bool isFPVectorizationPotentiallyUnsafe() = 0;
virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,		virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
unsigned BitWidth,		unsigned BitWidth,
▲ Show 20 Lines • Show All 432 Lines • ▼ Show 20 Lines	getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
ArrayRef<Type *> Tys) override {		ArrayRef<Type *> Tys) override {
return Impl.getOperandsScalarizationOverhead(Args, Tys);		return Impl.getOperandsScalarizationOverhead(Args, Tys);
}		}

bool supportsEfficientVectorElementLoadStore() override {		bool supportsEfficientVectorElementLoadStore() override {
return Impl.supportsEfficientVectorElementLoadStore();		return Impl.supportsEfficientVectorElementLoadStore();
}		}

		bool supportsComplexArithmetic() const override {
		return Impl.supportsComplexArithmetic();
		}

		Value createComplexArithmeticIR(ComplexArithmeticGraph &G, Value InputA,
		Value *InputB,
		int &GeneratedIntrinsicCount) override {
		return Impl.createComplexArithmeticIR(G, InputA, InputB,
		GeneratedIntrinsicCount);
		}

		bool matchComplexArithmeticIR(Instruction *I,
		ComplexArithmeticGraph &G) override {
		return Impl.matchComplexArithmeticIR(I, G);
		}

bool enableAggressiveInterleaving(bool LoopHasReductions) override {		bool enableAggressiveInterleaving(bool LoopHasReductions) override {
return Impl.enableAggressiveInterleaving(LoopHasReductions);		return Impl.enableAggressiveInterleaving(LoopHasReductions);
}		}
MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,		MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
bool IsZeroCmp) const override {		bool IsZeroCmp) const override {
return Impl.enableMemCmpExpansion(OptSize, IsZeroCmp);		return Impl.enableMemCmpExpansion(OptSize, IsZeroCmp);
}		}
bool enableInterleavedAccessVectorization() override {		bool enableInterleavedAccessVectorization() override {
▲ Show 20 Lines • Show All 462 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 328 Lines • ▼ Show 20 Lines	public:

InstructionCost getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,		InstructionCost getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
ArrayRef<Type *> Tys) const {		ArrayRef<Type *> Tys) const {
return 0;		return 0;
}		}

bool supportsEfficientVectorElementLoadStore() const { return false; }		bool supportsEfficientVectorElementLoadStore() const { return false; }

		bool supportsComplexArithmetic() const { return false; }
		Value createComplexArithmeticIR(ComplexArithmeticGraph &, Value , Value *,
		int &GeneratedIntrinsicCount) {
		return nullptr;
		}
		dnsampaioUnsubmitted Not Done Reply Inline Actions Is there any strong reason to create target specific intrinsics instead of having generic intrinsics with polymorphism? Such as `llvm.complex.add.v2f32`? dnsampaio: Is there any strong reason to create target specific intrinsics instead of having generic…

		bool matchComplexArithmeticIR(Instruction *I, ComplexArithmeticGraph &G) {
		return false;
		}

bool enableAggressiveInterleaving(bool LoopHasReductions) const {		bool enableAggressiveInterleaving(bool LoopHasReductions) const {
return false;		return false;
}		}

TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,		TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
bool IsZeroCmp) const {		bool IsZeroCmp) const {
return {};		return {};
}		}
▲ Show 20 Lines • Show All 900 Lines • Show Last 20 Lines

llvm/include/llvm/InitializePasses.h

	//===- llvm/InitializePasses.h - Initialize All Passes ----------- C++ --===//			//===- llvm/InitializePasses.h - Initialize All Passes ----------- C++ --===//
				dmgreenUnsubmitted Done Reply Inline Actions You can undo these dmgreen: You can undo these
				dmgreenUnsubmitted Done Reply Inline Actions There is a newline here? dmgreen: There is a newline here?
				dmgreenUnsubmitted Done Reply Inline Actions You can undo these. dmgreen: You can undo these.
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file contains the declarations for the pass initialization routines			// This file contains the declarations for the pass initialization routines
	▲ Show 20 Lines • Show All 99 Lines • ▼ Show 20 Lines
	void initializeCGProfileLegacyPassPass(PassRegistry &);			void initializeCGProfileLegacyPassPass(PassRegistry &);
	void initializeCallGraphDOTPrinterPass(PassRegistry&);			void initializeCallGraphDOTPrinterPass(PassRegistry&);
	void initializeCallGraphPrinterLegacyPassPass(PassRegistry&);			void initializeCallGraphPrinterLegacyPassPass(PassRegistry&);
	void initializeCallGraphViewerPass(PassRegistry&);			void initializeCallGraphViewerPass(PassRegistry&);
	void initializeCallGraphWrapperPassPass(PassRegistry&);			void initializeCallGraphWrapperPassPass(PassRegistry&);
	void initializeCallSiteSplittingLegacyPassPass(PassRegistry&);			void initializeCallSiteSplittingLegacyPassPass(PassRegistry&);
	void initializeCalledValuePropagationLegacyPassPass(PassRegistry &);			void initializeCalledValuePropagationLegacyPassPass(PassRegistry &);
	void initializeCheckDebugMachineModulePass(PassRegistry &);			void initializeCheckDebugMachineModulePass(PassRegistry &);
	void initializeCodeGenPreparePass(PassRegistry&);			void initializeCodeGenPreparePass(PassRegistry &);
				void initializeComplexArithmeticLegacyPassPass(PassRegistry &);
	void initializeConstantHoistingLegacyPassPass(PassRegistry&);			void initializeConstantHoistingLegacyPassPass(PassRegistry &);
	void initializeConstantMergeLegacyPassPass(PassRegistry&);			void initializeConstantMergeLegacyPassPass(PassRegistry &);
	void initializeConstraintEliminationPass(PassRegistry &);			void initializeConstraintEliminationPass(PassRegistry &);
	void initializeControlHeightReductionLegacyPassPass(PassRegistry&);			void initializeControlHeightReductionLegacyPassPass(PassRegistry&);
	void initializeCorrelatedValuePropagationPass(PassRegistry&);			void initializeCorrelatedValuePropagationPass(PassRegistry&);
	void initializeCostModelAnalysisPass(PassRegistry&);			void initializeCostModelAnalysisPass(PassRegistry&);
	void initializeCrossDSOCFIPass(PassRegistry&);			void initializeCrossDSOCFIPass(PassRegistry&);
	void initializeCycleInfoWrapperPassPass(PassRegistry &);			void initializeCycleInfoWrapperPassPass(PassRegistry &);
	void initializeDAEPass(PassRegistry&);			void initializeDAEPass(PassRegistry&);
	void initializeDAHPass(PassRegistry&);			void initializeDAHPass(PassRegistry&);
	▲ Show 20 Lines • Show All 341 Lines • Show Last 20 Lines

llvm/include/llvm/Transforms/Scalar.h

	Show First 20 Lines • Show All 549 Lines • ▼ Show 20 Lines

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This pass does instruction simplification on each			// This pass does instruction simplification on each
	// instruction in a function.			// instruction in a function.
	//			//
	FunctionPass *createInstSimplifyLegacyPass();			FunctionPass *createInstSimplifyLegacyPass();


	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// createScalarizeMaskedMemIntrinPass - Replace masked load, store, gather			// createScalarizeMaskedMemIntrinPass - Replace masked load, store, gather
	// and scatter intrinsics with scalar code when target doesn't support them.			// and scatter intrinsics with scalar code when target doesn't support them.
	//			//
	FunctionPass *createScalarizeMaskedMemIntrinLegacyPass();			FunctionPass *createScalarizeMaskedMemIntrinLegacyPass();
	} // End llvm namespace
				//===----------------------------------------------------------------------===//
				//
				// This pass implements generation of target-specific intrinsics to support
				// handling of complex number arithmetic
				//
				FunctionPass *createComplexArithmeticPass();

				} // namespace llvm

	#endif			#endif

llvm/include/llvm/Transforms/Scalar/ComplexArithmetic.h

This file was added.

				//===- ComplexArithmetic.h - Complex Arithmetic Pass --------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass implements generation of target-specific intrinsics to support
				// handling of complex number arithmetic.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TRANSFORMS_SCALAR_COMPLEXARITHMETIC_H
				#define LLVM_TRANSFORMS_SCALAR_COMPLEXARITHMETIC_H

				#include "llvm/IR/PassManager.h"
				#include "llvm/IR/PatternMatch.h"

				namespace llvm {

				class Function;

				struct ComplexArithmeticPass : public PassInfoMixin<ComplexArithmeticPass> {
				public:
				PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
				};

				/**
				* Annotated graph-like structure that enriches the existing Instruction graph,
				* allowing for contextual clues relevant to complex arithmetic to be provided
				* and given to TTI hooks as required.
				*/
				class ComplexArithmeticGraph {
				jcranmer-intelUnsubmitted Not Done Reply Inline Actions This says `ComplexArithmetic`, but it's mostly just limited to complex multiplication, right? There's no support for complex division or absolute value that I see (not that complex division is implemented in any hardware I'm aware of). jcranmer-intel: This says `ComplexArithmetic`, but it's mostly just limited to complex multiplication, right?
				public:
				/**
				* Bitflags denoting the type of the instruction node.
				*/
				enum NodeType {
				Unknown = 0,
				// Actual node types

				Real = 1,
				Imaginary = 2,
				Load = 4,
				Store = 8,
				Shuffle = 16,
				AddOperand = 32,

				// Meta node types, defining additional behaviour upon node creation

				/**
				* Will cause the node to look at parents to try and identify the type.
				* Parents must already be registered and identified.
				*/
				Discover = 0xffff,
				};

				enum GraphType { Complex_Mul, Complex_Add };

				struct Node {
				public:
				Instruction *I;
				NodeType NType;

				Node(Instruction *i, enum NodeType nodeType) : I(i), NType(nodeType) {}
				};

				/**
				* Returns a copy of the vector of all registered nodes.
				*/
				SmallVector<Instruction *> getAllNodes() {
				SmallVector<Instruction *> Is;
				for (auto &N : Nodes)
				Is.push_back(N->I);
				return Is;
				}

				/**
				* Returns a vector of all registered nodes that are of the given type.
				*/
				SmallVector<Instruction *> getNodesOfType(enum NodeType Type) {
				SmallVector<Instruction *> Is;
				for (auto &N : Nodes) {
				if ((N->NType & Type) == Type)
				Is.push_back(N->I);
				}
				return Is;
				}

				/**
				* Returns the node type of I. It must already be registered and identified,
				* otherwise `Unknown` is returned.
				*/
				enum NodeType getNodeType(Instruction *I) {
				auto *N = getNode(I);
				if (N == nullptr)
				return Unknown;
				return N->NType;
				}

				/**
				* Registers and identifies the given Instruction, optionally with the
				* provided NodeType.
				*/
				void addNode(Instruction *I, enum NodeType NodeType = NodeType::Unknown) {
				if ((NodeType & Discover) == Discover) {
				dnsampaioUnsubmitted Not Done Reply Inline Actions nit. Are you expecting to have values in `NodeTypes` enum bigger than `Discover` ? If not, you don't need the and operation "& Discover" if you define that the underlying type of the enum is unsigned or unsigned short. dnsampaio: nit. Are you expecting to have values in `NodeTypes` enum bigger than `Discover` ? If not, you…
				auto LeftType = getNodeType(cast<Instruction>(I->getOperand(0)));
				auto RightType = getNodeType(cast<Instruction>(I->getOperand(1)));

				if (LeftType == Unknown \|\| RightType == Unknown) {
				NodeType = Unknown;
				} else {
				if (I->getOpcode() == Instruction::FMul) {
				if (LeftType == RightType)
				NodeType = Real;
				else
				NodeType = Imaginary;
				} else {
				NodeType = LeftType;
				}
				}
				}

				auto *Existing = getNode(I);
				if (Existing != nullptr) {
				if (Existing->NType == NodeType)
				return;
				llvm_unreachable(
				"A node has been added twice, with conflicting nodetypes.");
				}

				auto N = std::make_unique<Node>(I, NodeType);
				Nodes.push_back(std::move(N));
				}

				LLVMContext &getContext() { return CurrentI->getContext(); }

				Instruction *getCurrentInstruction() { return CurrentI; }

				void setCurrentInstruction(Instruction *I) { CurrentI = I; }

				void setType(enum GraphType type) { GType = type; }

				enum GraphType getType() { return GType; }

				void setRotation(unsigned R) { Rotation = R; }

				unsigned getRotation() { return Rotation; }

				/**
				* Sets the graph userdata pointer. The graph then assumes ownership of the
				* pointer, and will free it on deconstruction.
				*/
				template <typename T> void setUserData(T *Ptr) {
				UserData = std::shared_ptr<T>(Ptr);
				}

				/**
				* Gets the graph userdata pointer, casting it to T.
				*
				* Note: No checks are made by the graph to ensure the type of the data is as
				* requested. It is up to the caller to check for that.
				*/
				template <typename T> T *getUserData() {
				if (UserData == nullptr)
				return nullptr;
				auto *Ptr = UserData.get();
				return (T *)Ptr;
				}

				private:
				unsigned Rotation;
				GraphType GType;
				Instruction *CurrentI = nullptr;
				// std::unique_ptr doesn't support void* without an explicit deleter
				std::shared_ptr<void> UserData;

				Node getNode(Instruction I) {
				for (const auto &item : Nodes) {
				if (item->I == I)
				return item.get();
				}
				return nullptr;
				}

				SmallVector<std::unique_ptr<Node>> Nodes;
				};

				inline ComplexArithmeticGraph::NodeType
				operator~(ComplexArithmeticGraph::NodeType a) {
				return (ComplexArithmeticGraph::NodeType) ~(int)a;
				}
				inline ComplexArithmeticGraph::NodeType
				operator\|(ComplexArithmeticGraph::NodeType a,
				ComplexArithmeticGraph::NodeType b) {
				return (ComplexArithmeticGraph::NodeType)((int)a \| (int)b);
				}
				inline ComplexArithmeticGraph::NodeType
				operator&(ComplexArithmeticGraph::NodeType a,
				ComplexArithmeticGraph::NodeType b) {
				return (ComplexArithmeticGraph::NodeType)((int)a & (int)b);
				}
				inline ComplexArithmeticGraph::NodeType
				operator^(ComplexArithmeticGraph::NodeType a,
				ComplexArithmeticGraph::NodeType b) {
				return (ComplexArithmeticGraph::NodeType)((int)a ^ (int)b);
				}
				inline ComplexArithmeticGraph::NodeType &
				operator\|=(ComplexArithmeticGraph::NodeType &a,
				ComplexArithmeticGraph::NodeType b) {
				return (ComplexArithmeticGraph::NodeType &)((int &)a \|= (int)b);
				}
				inline ComplexArithmeticGraph::NodeType &
				operator&=(ComplexArithmeticGraph::NodeType &a,
				ComplexArithmeticGraph::NodeType b) {
				return (ComplexArithmeticGraph::NodeType &)((int &)a &= (int)b);
				}
				inline ComplexArithmeticGraph::NodeType &
				operator^=(ComplexArithmeticGraph::NodeType &a,
				ComplexArithmeticGraph::NodeType b) {
				return (ComplexArithmeticGraph::NodeType &)((int &)a ^= (int)b);
				}

				/**
				* Creates a contiguous mask of the given length, optionally with a base offset.
				*/
				static ArrayRef<int> createContiguousMask(int len, int offset = 0) {
				int *Arr = new int[len];
				for (unsigned j = 0; j < len; j++)
				Arr[j] = j + offset;
				return ArrayRef<int>(Arr, len);
				}

				/**
				* Creates an interleaving mask of the given length.
				*/
				static ArrayRef<int> createInterleavingMask(int len) {
				int Step = len / 2;
				int *Arr = new int[len];
				int idx = 0;
				for (unsigned j = 0; j < len; j += 2) {
				Arr[j] = idx;
				Arr[j + 1] = idx + Step;
				idx++;
				};
				return ArrayRef<int>(Arr, len);
				}

				}; // namespace llvm

				#endif // LLVM_TRANSFORMS_SCALAR_COMPLEXARITHMETIC_H

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 501 Lines • ▼ Show 20 Lines	InstructionCost TargetTransformInfo::getOperandsScalarizationOverhead(
ArrayRef<const Value > Args, ArrayRef<Type > Tys) const {		ArrayRef<const Value > Args, ArrayRef<Type > Tys) const {
return TTIImpl->getOperandsScalarizationOverhead(Args, Tys);		return TTIImpl->getOperandsScalarizationOverhead(Args, Tys);
}		}

bool TargetTransformInfo::supportsEfficientVectorElementLoadStore() const {		bool TargetTransformInfo::supportsEfficientVectorElementLoadStore() const {
return TTIImpl->supportsEfficientVectorElementLoadStore();		return TTIImpl->supportsEfficientVectorElementLoadStore();
}		}

		bool TargetTransformInfo::supportsComplexArithmetic() const {
		return TTIImpl->supportsComplexArithmetic();
		}

		Value *
		TargetTransformInfo::createComplexArithmeticIR(ComplexArithmeticGraph &G,
		Value InputA, Value InputB,
		int &GeneratedIntrinsicCount) {
		return TTIImpl->createComplexArithmeticIR(G, InputA, InputB,
		GeneratedIntrinsicCount);
		}

		bool TargetTransformInfo::matchComplexArithmeticIR(Instruction *I,
		ComplexArithmeticGraph &G) {
		return TTIImpl->matchComplexArithmeticIR(I, G);
		}

bool TargetTransformInfo::enableAggressiveInterleaving(		bool TargetTransformInfo::enableAggressiveInterleaving(
bool LoopHasReductions) const {		bool LoopHasReductions) const {
return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);		return TTIImpl->enableAggressiveInterleaving(LoopHasReductions);
}		}

TargetTransformInfo::MemCmpExpansionOptions		TargetTransformInfo::MemCmpExpansionOptions
TargetTransformInfo::enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const {		TargetTransformInfo::enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const {
return TTIImpl->enableMemCmpExpansion(OptSize, IsZeroCmp);		return TTIImpl->enableMemCmpExpansion(OptSize, IsZeroCmp);
▲ Show 20 Lines • Show All 689 Lines • Show Last 20 Lines

llvm/lib/Passes/PassBuilder.cpp

	Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines
	#include "llvm/Transforms/Instrumentation/SanitizerCoverage.h"			#include "llvm/Transforms/Instrumentation/SanitizerCoverage.h"
	#include "llvm/Transforms/Instrumentation/ThreadSanitizer.h"			#include "llvm/Transforms/Instrumentation/ThreadSanitizer.h"
	#include "llvm/Transforms/ObjCARC.h"			#include "llvm/Transforms/ObjCARC.h"
	#include "llvm/Transforms/Scalar/ADCE.h"			#include "llvm/Transforms/Scalar/ADCE.h"
	#include "llvm/Transforms/Scalar/AlignmentFromAssumptions.h"			#include "llvm/Transforms/Scalar/AlignmentFromAssumptions.h"
	#include "llvm/Transforms/Scalar/AnnotationRemarks.h"			#include "llvm/Transforms/Scalar/AnnotationRemarks.h"
	#include "llvm/Transforms/Scalar/BDCE.h"			#include "llvm/Transforms/Scalar/BDCE.h"
	#include "llvm/Transforms/Scalar/CallSiteSplitting.h"			#include "llvm/Transforms/Scalar/CallSiteSplitting.h"
				#include "llvm/Transforms/Scalar/ComplexArithmetic.h"
	#include "llvm/Transforms/Scalar/ConstantHoisting.h"			#include "llvm/Transforms/Scalar/ConstantHoisting.h"
	#include "llvm/Transforms/Scalar/ConstraintElimination.h"			#include "llvm/Transforms/Scalar/ConstraintElimination.h"
	#include "llvm/Transforms/Scalar/CorrelatedValuePropagation.h"			#include "llvm/Transforms/Scalar/CorrelatedValuePropagation.h"
	#include "llvm/Transforms/Scalar/DCE.h"			#include "llvm/Transforms/Scalar/DCE.h"
	#include "llvm/Transforms/Scalar/DFAJumpThreading.h"			#include "llvm/Transforms/Scalar/DFAJumpThreading.h"
	#include "llvm/Transforms/Scalar/DeadStoreElimination.h"			#include "llvm/Transforms/Scalar/DeadStoreElimination.h"
	#include "llvm/Transforms/Scalar/DivRemPairs.h"			#include "llvm/Transforms/Scalar/DivRemPairs.h"
	#include "llvm/Transforms/Scalar/EarlyCSE.h"			#include "llvm/Transforms/Scalar/EarlyCSE.h"
	▲ Show 20 Lines • Show All 1,676 Lines • Show Last 20 Lines

llvm/lib/Passes/PassRegistry.def

	Show First 20 Lines • Show All 235 Lines • ▼ Show 20 Lines
	FUNCTION_PASS("assume-builder", AssumeBuilderPass())			FUNCTION_PASS("assume-builder", AssumeBuilderPass())
	FUNCTION_PASS("assume-simplify", AssumeSimplifyPass())			FUNCTION_PASS("assume-simplify", AssumeSimplifyPass())
	FUNCTION_PASS("alignment-from-assumptions", AlignmentFromAssumptionsPass())			FUNCTION_PASS("alignment-from-assumptions", AlignmentFromAssumptionsPass())
	FUNCTION_PASS("annotation-remarks", AnnotationRemarksPass())			FUNCTION_PASS("annotation-remarks", AnnotationRemarksPass())
	FUNCTION_PASS("bdce", BDCEPass())			FUNCTION_PASS("bdce", BDCEPass())
	FUNCTION_PASS("bounds-checking", BoundsCheckingPass())			FUNCTION_PASS("bounds-checking", BoundsCheckingPass())
	FUNCTION_PASS("break-crit-edges", BreakCriticalEdgesPass())			FUNCTION_PASS("break-crit-edges", BreakCriticalEdgesPass())
	FUNCTION_PASS("callsite-splitting", CallSiteSplittingPass())			FUNCTION_PASS("callsite-splitting", CallSiteSplittingPass())
				FUNCTION_PASS("complex-arithmetic", ComplexArithmeticPass())
	FUNCTION_PASS("consthoist", ConstantHoistingPass())			FUNCTION_PASS("consthoist", ConstantHoistingPass())
	FUNCTION_PASS("constraint-elimination", ConstraintEliminationPass())			FUNCTION_PASS("constraint-elimination", ConstraintEliminationPass())
	FUNCTION_PASS("chr", ControlHeightReductionPass())			FUNCTION_PASS("chr", ControlHeightReductionPass())
	FUNCTION_PASS("coro-early", CoroEarlyPass())			FUNCTION_PASS("coro-early", CoroEarlyPass())
	FUNCTION_PASS("coro-elide", CoroElidePass())			FUNCTION_PASS("coro-elide", CoroElidePass())
	FUNCTION_PASS("coro-cleanup", CoroCleanupPass())			FUNCTION_PASS("coro-cleanup", CoroCleanupPass())
	FUNCTION_PASS("correlated-propagation", CorrelatedValuePropagationPass())			FUNCTION_PASS("correlated-propagation", CorrelatedValuePropagationPass())
	FUNCTION_PASS("dce", DCEPass())			FUNCTION_PASS("dce", DCEPass())
	▲ Show 20 Lines • Show All 271 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetMachine.cpp

//===-- ARMTargetMachine.cpp - Define TargetMachine for ARM ---------------===//		//===-- ARMTargetMachine.cpp - Define TargetMachine for ARM ---------------===//
		dmgreenUnsubmitted Done Reply Inline Actions This doesn't need to add the new brackets. dmgreen: This doesn't need to add the new brackets.
		dmgreenUnsubmitted Done Reply Inline Actions This again. dmgreen: This again.
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
//		//
▲ Show 20 Lines • Show All 408 Lines • ▼ Show 20 Lines	addPass(createCFGSimplificationPass(
}));		}));

addPass(createMVEGatherScatterLoweringPass());		addPass(createMVEGatherScatterLoweringPass());
addPass(createMVELaneInterleavingPass());		addPass(createMVELaneInterleavingPass());

TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();

// Run the parallel DSP pass.		// Run the parallel DSP pass.
if (getOptLevel() == CodeGenOpt::Aggressive)		if (getOptLevel() == CodeGenOpt::Aggressive)
addPass(createARMParallelDSPPass());		addPass(createARMParallelDSPPass());

		// Match complex arithmetic patterns
		if (TM->getOptLevel() >= CodeGenOpt::Default)
		addPass(createComplexArithmeticPass());

// Match interleaved memory accesses to ldN/stN intrinsics.		// Match interleaved memory accesses to ldN/stN intrinsics.
if (TM->getOptLevel() != CodeGenOpt::None)		if (TM->getOptLevel() != CodeGenOpt::None) {
addPass(createInterleavedAccessPass());		addPass(createInterleavedAccessPass());
		}

// Add Control Flow Guard checks.		// Add Control Flow Guard checks.
if (TM->getTargetTriple().isOSWindows())		if (TM->getTargetTriple().isOSWindows())
addPass(createCFGuardCheckPass());		addPass(createCFGuardCheckPass());
}		}

void ARMPassConfig::addCodeGenPrepare() {		void ARMPassConfig::addCodeGenPrepare() {
if (getOptLevel() != CodeGenOpt::None)		if (getOptLevel() != CodeGenOpt::None)
▲ Show 20 Lines • Show All 146 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	public:
Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,		Optional<Instruction *> instCombineIntrinsic(InstCombiner &IC,
IntrinsicInst &II) const;		IntrinsicInst &II) const;
Optional<Value *> simplifyDemandedVectorEltsIntrinsic(		Optional<Value *> simplifyDemandedVectorEltsIntrinsic(
InstCombiner &IC, IntrinsicInst &II, APInt DemandedElts, APInt &UndefElts,		InstCombiner &IC, IntrinsicInst &II, APInt DemandedElts, APInt &UndefElts,
APInt &UndefElts2, APInt &UndefElts3,		APInt &UndefElts2, APInt &UndefElts3,
std::function<void(Instruction *, unsigned, APInt, APInt &)>		std::function<void(Instruction *, unsigned, APInt, APInt &)>
SimplifyAndSetOp) const;		SimplifyAndSetOp) const;

		bool supportsComplexArithmetic() const;
		Value createComplexArithmeticIR(ComplexArithmeticGraph &G, Value InputA,
		Value *InputB, int &GeneratedIntrinsicCount);
		bool matchComplexArithmeticIR(Instruction *I, ComplexArithmeticGraph &G);

/// \name Scalar TTI Implementations		/// \name Scalar TTI Implementations
/// @{		/// @{

InstructionCost getIntImmCodeSizeCost(unsigned Opcode, unsigned Idx,		InstructionCost getIntImmCodeSizeCost(unsigned Opcode, unsigned Idx,
const APInt &Imm, Type *Ty);		const APInt &Imm, Type *Ty);

using BaseT::getIntImmCost;		using BaseT::getIntImmCost;
InstructionCost getIntImmCost(const APInt &Imm, Type *Ty,		InstructionCost getIntImmCost(const APInt &Imm, Type *Ty,
▲ Show 20 Lines • Show All 211 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

	//===- ARMTargetTransformInfo.cpp - ARM specific TTI ----------------------===//			//===- ARMTargetTransformInfo.cpp - ARM specific TTI ----------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "ARMTargetTransformInfo.h"			#include "ARMTargetTransformInfo.h"
	#include "ARMSubtarget.h"			#include "ARMSubtarget.h"
	#include "MCTargetDesc/ARMAddressingModes.h"			#include "MCTargetDesc/ARMAddressingModes.h"
	#include "llvm/ADT/APInt.h"			#include "llvm/ADT/APInt.h"
	#include "llvm/ADT/SmallVector.h"			#include "llvm/ADT/SmallVector.h"
	#include "llvm/Analysis/LoopInfo.h"			#include "llvm/Analysis/LoopInfo.h"
	#include "llvm/CodeGen/CostTable.h"			#include "llvm/CodeGen/CostTable.h"
				dmgreenUnsubmitted Done Reply Inline Actions These are probably unneeded? dmgreen: These are probably unneeded?
	#include "llvm/CodeGen/ISDOpcodes.h"			#include "llvm/CodeGen/ISDOpcodes.h"
	#include "llvm/CodeGen/ValueTypes.h"			#include "llvm/CodeGen/ValueTypes.h"
	#include "llvm/IR/BasicBlock.h"			#include "llvm/IR/BasicBlock.h"
	#include "llvm/IR/DataLayout.h"			#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/DerivedTypes.h"			#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/Instruction.h"			#include "llvm/IR/Instruction.h"
	#include "llvm/IR/Instructions.h"			#include "llvm/IR/Instructions.h"
	#include "llvm/IR/Intrinsics.h"
	#include "llvm/IR/IntrinsicInst.h"			#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/IR/Intrinsics.h"
	#include "llvm/IR/IntrinsicsARM.h"			#include "llvm/IR/IntrinsicsARM.h"
	#include "llvm/IR/PatternMatch.h"			#include "llvm/IR/PatternMatch.h"
	#include "llvm/IR/Type.h"			#include "llvm/IR/Type.h"
	#include "llvm/MC/SubtargetFeature.h"			#include "llvm/MC/SubtargetFeature.h"
	#include "llvm/Support/Casting.h"			#include "llvm/Support/Casting.h"
	#include "llvm/Support/KnownBits.h"			#include "llvm/Support/KnownBits.h"
	#include "llvm/Support/MachineValueType.h"			#include "llvm/Support/MachineValueType.h"
	#include "llvm/Target/TargetMachine.h"			#include "llvm/Target/TargetMachine.h"
	#include "llvm/Transforms/InstCombine/InstCombiner.h"			#include "llvm/Transforms/InstCombine/InstCombiner.h"
				#include "llvm/Transforms/Scalar/ComplexArithmetic.h"
	#include "llvm/Transforms/Utils/Local.h"			#include "llvm/Transforms/Utils/Local.h"
	#include "llvm/Transforms/Utils/LoopUtils.h"			#include "llvm/Transforms/Utils/LoopUtils.h"
	#include <algorithm>			#include <algorithm>
	#include <cassert>			#include <cassert>
	#include <cstdint>			#include <cstdint>
	#include <utility>			#include <utility>

	using namespace llvm;			using namespace llvm;
	▲ Show 20 Lines • Show All 2,297 Lines • ▼ Show 20 Lines
	}			}

	bool ARMTTIImpl::preferPredicatedReductionSelect(			bool ARMTTIImpl::preferPredicatedReductionSelect(
	unsigned Opcode, Type *Ty, TTI::ReductionFlags Flags) const {			unsigned Opcode, Type *Ty, TTI::ReductionFlags Flags) const {
	if (!ST->hasMVEIntegerOps())			if (!ST->hasMVEIntegerOps())
	return false;			return false;
	return true;			return true;
	}			}

				bool ARMTTIImpl::supportsComplexArithmetic() const {
				return ST->hasMVEFloatOps();
				}

				namespace {
				struct ARMComplexArithmeticMetadata {
				bool Halving;
				};
				} // namespace

				Value *ARMTTIImpl::createComplexArithmeticIR(ComplexArithmeticGraph &G,
				Value InputA, Value InputB,
				int &GeneratedIntrinsicCount) {
				auto *Ty = InputA->getType();
				if (!isa<FixedVectorType>(Ty))
				return nullptr;
				auto *VTy = cast<FixedVectorType>(Ty);
				dnsampaioUnsubmitted Not Done Reply Inline Actions nit. Add new line? dnsampaio: nit. Add new line?

				// Cannot widen complex intrinsics to fill vectors
				if (VTy->getNumElements() * VTy->getScalarSizeInBits() != 128)
				return nullptr;

				// MVE does not support double complex operations
				if (VTy->getScalarType()->isDoubleTy())
				return nullptr;

				if (G.getType() == ComplexArithmeticGraph::Complex_Mul) {

				IRBuilder<> B(G.getCurrentInstruction());
				auto *IntTy = Type::getInt32Ty(G.getCurrentInstruction()->getContext());
				int RotIdx = G.getRotation() / 90;

				auto *ConstMulRot = ConstantInt::get(IntTy, RotIdx);
				auto *ConstMlaRot = ConstantInt::get(IntTy, (RotIdx + 1) % 4);
				auto *Mul = B.CreateIntrinsic(Intrinsic::arm_mve_vcmulq, Ty,
				{ConstMulRot, InputA, InputB});
				auto *Mla = B.CreateIntrinsic(Intrinsic::arm_mve_vcmlaq, Ty,
				{ConstMlaRot, Mul, InputA, InputB});
				GeneratedIntrinsicCount = 2;
				return Mla;
				}

				if (G.getType() == ComplexArithmeticGraph::Complex_Add) {
				IRBuilder<> B(G.getCurrentInstruction());

				auto *IntTy = Type::getInt32Ty(G.getContext());
				unsigned HalvingVal = 1;

				auto *Meta = G.getUserData<ARMComplexArithmeticMetadata>();
				if (Meta && Meta->Halving)
				HalvingVal = 0;

				auto *Halving = ConstantInt::get(IntTy, HalvingVal);

				unsigned Rotation = G.getRotation();
				unsigned RotKey;
				if (Rotation == 90)
				RotKey = 0;
				else if (G.getRotation() == 270)
				RotKey = 1;
				else
				return nullptr; // Invalid rotation for arm_mve_vcaddq

				auto *RotVal = ConstantInt::get(IntTy, RotKey);
				GeneratedIntrinsicCount = 1;
				return B.CreateIntrinsic(Intrinsic::arm_mve_vcaddq, Ty,
				{Halving, RotVal, InputA, InputB});
				}

				return nullptr;
				}

				static bool matchComplexArithmeticHalvingAdd(Instruction *I,
				ComplexArithmeticGraph &G) {
				auto *SVI = dyn_cast<ShuffleVectorInst>(I);
				if (!SVI)
				return false;

				if (!SVI->getType()->getElementType()->isIntegerTy(32))
				return false;

				Value *ShuffleAR = nullptr;
				Value *ShuffleAI = nullptr;
				Value *ShuffleBR = nullptr;
				Value *ShuffleBI = nullptr;

				auto *Op0 = cast<Instruction>(SVI->getOperand(0));
				auto *Op1 = cast<Instruction>(SVI->getOperand(1));

				Op0 = cast<Instruction>(Op0->getOperand(0));
				Op1 = cast<Instruction>(Op1->getOperand(0));

				unsigned Rotation;
				if (Op0->getOpcode() == Instruction::FSub &&
				Op1->getOpcode() == Instruction::FAdd) {
				Rotation = 90;
				} else if (Op0->getOpcode() == Instruction::FAdd &&
				Op1->getOpcode() == Instruction::FSub) {
				Rotation = 270;
				} else {
				return false;
				}

				auto ShuffleMask = createInterleavingMask(SVI->getShuffleMask().size());

				auto *FloatTy = Type::getFloatTy(G.getContext());

				if (Rotation == 90) {

				dnsampaioUnsubmitted Not Done Reply Inline Actions nit. extra space. dnsampaio: nit. extra space.
				auto FSubPattern = m_FSub(m_Value(ShuffleBR), m_Value(ShuffleAI));
				auto FAddPattern = m_FAdd(m_Value(ShuffleBI), m_Value(ShuffleAR));

				auto *FP = ConstantFP::get(FixedVectorType::get(FloatTy, 4), 0.5f);
				auto Mul0Pattern = m_FMul(FSubPattern, m_SpecificFP(0.5f));
				auto Mul1Pattern = m_FMul(FAddPattern, m_SpecificFP(0.5f));

				if (!match(SVI, m_Shuffle(Mul0Pattern, Mul1Pattern,
				m_SpecificMask(ShuffleMask)))) {
				dbgs()
				<< "SVI does not match expected pattern for complex halving add rot "
				<< Rotation << ".\n";
				return false;
				}
				} else if (Rotation == 270) {
				if (!match(SVI, m_Shuffle(m_FAdd(m_Value(ShuffleBI), m_Value(ShuffleAR)),
				m_FSub(m_Value(ShuffleAI), m_Value(ShuffleBR)),
				m_SpecificMask(ShuffleMask)))) {
				dbgs()
				<< "SVI does not match expected pattern for complex halving add rot "
				<< Rotation << ".\n";
				return false;
				}
				}

				G.addNode(cast<Instruction>(ShuffleAR),
				ComplexArithmeticGraph::Shuffle \| ComplexArithmeticGraph::Real);
				G.addNode(cast<Instruction>(ShuffleAI),
				ComplexArithmeticGraph::Shuffle \|
				ComplexArithmeticGraph::Imaginary);
				G.addNode(cast<Instruction>(ShuffleBR),
				ComplexArithmeticGraph::Shuffle \| ComplexArithmeticGraph::Real);
				G.addNode(cast<Instruction>(ShuffleBI),
				ComplexArithmeticGraph::Shuffle \|
				ComplexArithmeticGraph::Imaginary);

				G.addNode(Op0, ComplexArithmeticGraph::AddOperand);
				G.addNode(Op1, ComplexArithmeticGraph::AddOperand);

				G.setType(ComplexArithmeticGraph::Complex_Add);
				G.setRotation(Rotation);
				auto *Meta = new ARMComplexArithmeticMetadata;
				Meta->Halving = true;
				G.setUserData(Meta);

				return true;
				}

				bool ARMTTIImpl::matchComplexArithmeticIR(Instruction *I,
				ComplexArithmeticGraph &G) {
				// if (I->getType()->isIntOrIntVectorTy(32)) {
				// if (matchComplexArithmeticHalvingAdd(I, G))
				// return true;
				// }
				return false;
				}

llvm/lib/Transforms/Scalar/CMakeLists.txt

	add_llvm_component_library(LLVMScalarOpts			add_llvm_component_library(LLVMScalarOpts
	ADCE.cpp			ADCE.cpp
	AlignmentFromAssumptions.cpp			AlignmentFromAssumptions.cpp
	AnnotationRemarks.cpp			AnnotationRemarks.cpp
	BDCE.cpp			BDCE.cpp
	CallSiteSplitting.cpp			CallSiteSplitting.cpp
				ComplexArithmetic.cpp
	ConstantHoisting.cpp			ConstantHoisting.cpp
	ConstraintElimination.cpp			ConstraintElimination.cpp
	CorrelatedValuePropagation.cpp			CorrelatedValuePropagation.cpp
	DCE.cpp			DCE.cpp
	DeadStoreElimination.cpp			DeadStoreElimination.cpp
	DFAJumpThreading.cpp			DFAJumpThreading.cpp
	DivRemPairs.cpp			DivRemPairs.cpp
	EarlyCSE.cpp			EarlyCSE.cpp
	▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/ComplexArithmetic.cpp

This file was added.

				//===- ComplexArithmeticPass.cpp ------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/Transforms/Scalar/ComplexArithmetic.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Transforms/Scalar.h"

				using namespace llvm;
				using namespace PatternMatch;

				#define DEBUG_TYPE "complex-arithmetic"

				STATISTIC(NumComplexIntrinsics, "Number of complex intrinsics generated");

				static cl::opt<bool> ComplexArithmeticEnabled(
				"enable-complex-arithmetic",
				cl::desc("Enable generation of complex arithmetic instructions"),
				cl::init(true), cl::Hidden);

				namespace {

				class ComplexArithmeticLegacyPass : public FunctionPass {
				public:
				static char ID;

				ComplexArithmeticLegacyPass() : FunctionPass(ID) {
				initializeComplexArithmeticLegacyPassPass(*PassRegistry::getPassRegistry());
				}

				StringRef getPassName() const override { return "Complex Arithmetic Pass"; }

				bool runOnFunction(Function &F) override;
				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.setPreservesCFG();
				AU.addRequired<TargetTransformInfoWrapperPass>();
				}
				};

				class ComplexArithmetic {
				public:
				ComplexArithmetic(TargetTransformInfo *tti) : TTI(tti) {}
				bool runOnFunction(Function &F);

				private:
				bool
				evaluateComplexArithmeticBasicBlock(BasicBlock *B,
				SmallVector<Instruction *> &DeadInsts);

				void cleanupDeadInsts(SmallVector<Instruction *> &DeadInsts);

				TargetTransformInfo *TTI = nullptr;
				};

				}; // namespace

				char ComplexArithmeticLegacyPass::ID = 0;

				INITIALIZE_PASS_BEGIN(ComplexArithmeticLegacyPass, DEBUG_TYPE,
				"Complex Arithmetic", false, false)
				INITIALIZE_PASS_END(ComplexArithmeticLegacyPass, DEBUG_TYPE,
				"Complex Arithmetic", false, false)

				PreservedAnalyses ComplexArithmeticPass::run(Function &F,
				FunctionAnalysisManager &AM) {
				auto &TTI = AM.getResult<TargetIRAnalysis>(F);
				if (!ComplexArithmetic(&TTI).runOnFunction(F))
				return PreservedAnalyses::all();

				PreservedAnalyses PA;
				PA.preserve<FunctionAnalysisManagerModuleProxy>();
				return PA;
				}

				FunctionPass *llvm::createComplexArithmeticPass() {
				return new ComplexArithmeticLegacyPass();
				}

				bool ComplexArithmeticLegacyPass::runOnFunction(Function &F) {
				auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
				return ComplexArithmetic(&TTI).runOnFunction(F);
				}

				static bool HasBeenDisabled = false;
				bool ComplexArithmetic::runOnFunction(Function &F) {
				LLVM_DEBUG(dbgs() << "ComplexArithmetic::runOnFunction"
				<< ".\n");

				if (!ComplexArithmeticEnabled) {
				LLVM_DEBUG(if (!HasBeenDisabled) dbgs()
				<< "Complex has been explicitly disabled.\n");
				return false;
				}

				if (!TTI->supportsComplexArithmetic()) {
				LLVM_DEBUG(if (!HasBeenDisabled) dbgs()
				<< "Complex has been disabled, "
				"target does not support lowering of complex numbers.\n");
				return false;
				}

				bool Changed = false;
				SmallVector<Instruction *> DeadInsts;
				for (auto &B : F)
				Changed \|= evaluateComplexArithmeticBasicBlock(&B, DeadInsts);

				if (Changed)
				cleanupDeadInsts(DeadInsts);

				return Changed;
				}

				/**
				* Checks the given mask, and determines whether said mask is interleaving.
				*
				* To be interleaving, a mask must alternate between `i` and `i + (Length / 2)`,
				* and must contain all numbers within the range of `[0..Length)`
				* (e.g. a 4x vector interleaving mask would be <0, 2, 1, 3>).
				*/
				static bool isInterleavingMask(ArrayRef<int> Mask, int NumElements) {
				if (Mask.size() != NumElements * 2) {
				return false;
				}

				for (unsigned i = 0; i < NumElements; ++i) {
				if (Mask[(i * 2) + 1] != (Mask[i * 2] + NumElements)) {
				return false;
				}
				}

				return true;
				}

				/**
				* Checks the mask of the given ShuffleVectorInst, and determines whether said
				* shuffle is interleaving. See isInterleavingMask.
				*/
				static bool isInterleaving(ShuffleVectorInst *SVI) {
				auto *Ty = dyn_cast<FixedVectorType>(SVI->getOperand(0)->getType());
				if (!Ty)
				return false;

				unsigned NumElements = Ty->getNumElements();
				return isInterleavingMask(SVI->getShuffleMask(), NumElements);
				}

				/**
				* Checks the given mask, and determines whether said mask is deinterleaving.
				*
				* To be deinterleaving, a mask must match the pattern `i * 2`, with an optional
				* offset of 1. (e.g. a 4x vector deinterleaving mask would look like <0, 2, 4,
				* 6> or <1, 3, 5, 7>).
				*/
				static bool isDeinterleavingMask(ArrayRef<int> Mask, int NumElements) {
				if (Mask.size() != NumElements)
				return false;

				for (unsigned i = 0; i < Mask.size() - 1; ++i) {
				if (Mask[i + 1] != (Mask[i] + NumElements))
				return false;
				}

				return true;
				}

				static bool matchComplexMul(ShuffleVectorInst *SVI, ComplexArithmeticGraph &G) {

				unsigned LikelyRotation = 0;

				Value *LeftShuffleAR;
				Value *LeftShuffleAI;
				Value *LeftShuffleBR;
				Value *LeftShuffleBI;

				Value *RightShuffleAR;
				Value *RightShuffleAI;
				Value *RightShuffleBR;
				Value *RightShuffleBI;

				auto Mask = createInterleavingMask(SVI->getShuffleMask().size());

				auto InterleaveShuffleRot0Pattern = m_Shuffle(
				m_FSub(m_FMul(m_Value(LeftShuffleBR), m_Value(LeftShuffleAR)),
				m_FMul(m_Value(LeftShuffleBI), m_Value(LeftShuffleAI))),
				m_FAdd(m_FMul(m_Value(RightShuffleBI), m_Value(RightShuffleAR)),
				m_FMul(m_Value(RightShuffleBR), m_Value(RightShuffleAI))),
				m_SpecificMask(Mask));

				auto InterleaveShuffleRot180Pattern = m_Shuffle(
				m_FSub(m_FMul(m_Value(LeftShuffleBI), m_Value(LeftShuffleAI)),
				m_FMul(m_Value(LeftShuffleBR), m_Value(LeftShuffleAR))),
				m_FSub(m_FMul(m_Value(RightShuffleBR), m_FNeg(m_Value(RightShuffleAI))),
				m_FMul(m_Value(RightShuffleBI), m_Value(RightShuffleAR))),
				m_SpecificMask(Mask));

				if (match(SVI, InterleaveShuffleRot0Pattern))
				LikelyRotation = 0;
				else if (match(SVI, InterleaveShuffleRot180Pattern))
				LikelyRotation = 180;
				else {
				LLVM_DEBUG(dbgs() << "SVI does not match expected patterns.\n");
				return false;
				}

				if (LeftShuffleAR != RightShuffleAR)
				return false;
				if (LeftShuffleAI != RightShuffleAI)
				return false;
				if (LeftShuffleBR != RightShuffleBR)
				return false;
				if (LeftShuffleBI != RightShuffleBI)
				return false;

				G.addNode(cast<Instruction>(LeftShuffleAR),
				ComplexArithmeticGraph::Shuffle \| ComplexArithmeticGraph::Real);
				G.addNode(cast<Instruction>(LeftShuffleAI),
				ComplexArithmeticGraph::Shuffle \|
				ComplexArithmeticGraph::Imaginary);
				G.addNode(cast<Instruction>(LeftShuffleBR),
				ComplexArithmeticGraph::Shuffle \| ComplexArithmeticGraph::Real);
				G.addNode(cast<Instruction>(LeftShuffleBI),
				ComplexArithmeticGraph::Shuffle \|
				ComplexArithmeticGraph::Imaginary);

				auto *Op0 = cast<Instruction>(SVI->getOperand(0));
				auto *FAdd = cast<Instruction>(SVI->getOperand(1));

				G.addNode(Op0, ComplexArithmeticGraph::Real);
				G.addNode(FAdd, ComplexArithmeticGraph::Imaginary);

				G.addNode(cast<Instruction>(Op0->getOperand(0)),
				ComplexArithmeticGraph::Discover);
				G.addNode(cast<Instruction>(Op0->getOperand(1)),
				ComplexArithmeticGraph::Discover);
				G.addNode(cast<Instruction>(FAdd->getOperand(0)),
				ComplexArithmeticGraph::Discover);
				G.addNode(cast<Instruction>(FAdd->getOperand(1)),
				ComplexArithmeticGraph::Discover);

				G.setType(ComplexArithmeticGraph::Complex_Mul);
				G.setRotation(LikelyRotation);

				return true;
				}

				static bool matchComplexAdd(ShuffleVectorInst *SVI, ComplexArithmeticGraph &G) {
				Value *ShuffleAR;
				Value *ShuffleAI;
				Value *ShuffleBR;
				Value *ShuffleBI;

				auto *Op0 = dyn_cast<Instruction>(SVI->getOperand(0));
				auto *Op1 = dyn_cast<Instruction>(SVI->getOperand(1));

				if (!Op0 \|\| !Op1)
				return false;

				unsigned Rotation;
				if (Op0->getOpcode() == Instruction::FSub &&
				Op1->getOpcode() == Instruction::FAdd) {
				Rotation = 90;
				} else if (Op0->getOpcode() == Instruction::FAdd &&
				Op1->getOpcode() == Instruction::FSub) {
				Rotation = 270;
				} else {
				return false;
				}

				auto ShuffleMask = createInterleavingMask(SVI->getShuffleMask().size());

				if (Rotation == 90) {
				if (!match(SVI, m_Shuffle(m_FSub(m_Value(ShuffleAR), m_Value(ShuffleBI)),
				m_FAdd(m_Value(ShuffleAI), m_Value(ShuffleBR)),
				m_SpecificMask(ShuffleMask)))) {
				LLVM_DEBUG(
				dbgs() << "SVI does not match expected pattern for complex add rot "
				<< Rotation << ".\n");
				return false;
				}
				} else if (Rotation == 270) {
				if (!match(SVI, m_Shuffle(m_FAdd(m_Value(ShuffleBI), m_Value(ShuffleAR)),
				m_FSub(m_Value(ShuffleAI), m_Value(ShuffleBR)),
				m_SpecificMask(ShuffleMask)))) {
				LLVM_DEBUG(
				dbgs() << "SVI does not match expected pattern for complex add rot "
				<< Rotation << ".\n");
				return false;
				}
				}

				if (!isa<ShuffleVectorInst>(ShuffleAR) \|\|
				!isa<ShuffleVectorInst>(ShuffleAI) \|\|
				!isa<ShuffleVectorInst>(ShuffleBR) \|\|
				!isa<ShuffleVectorInst>(ShuffleAI)) {
				LLVM_DEBUG(dbgs() << "SVI does not match expected pattern for complex add, "
				"inputs aren't all shuffles.\n");
				return false;
				}

				G.addNode(cast<Instruction>(ShuffleAR),
				ComplexArithmeticGraph::Shuffle \| ComplexArithmeticGraph::Real);
				G.addNode(cast<Instruction>(ShuffleAI),
				ComplexArithmeticGraph::Shuffle \|
				ComplexArithmeticGraph::Imaginary);
				G.addNode(cast<Instruction>(ShuffleBR),
				ComplexArithmeticGraph::Shuffle \| ComplexArithmeticGraph::Real);
				G.addNode(cast<Instruction>(ShuffleBI),
				ComplexArithmeticGraph::Shuffle \|
				ComplexArithmeticGraph::Imaginary);

				G.addNode(Op0, ComplexArithmeticGraph::AddOperand);
				G.addNode(Op1, ComplexArithmeticGraph::AddOperand);

				G.setType(ComplexArithmeticGraph::Complex_Add);
				G.setRotation(Rotation);

				return true;
				}

				static bool traverseAndPopulateGraph(TargetTransformInfo TTI, Instruction I,
				ComplexArithmeticGraph &G) {
				G.setCurrentInstruction(I);

				// Shuffle mask needs to interleave vectors
				// e.g.
				// <4 x i32> <0, 2, 1, 3>
				// <8 x i32> <0, 4, 1, 5, 2, 6, 3, 7>

				if (auto *SVI = dyn_cast<ShuffleVectorInst>(I)) {
				if (!isInterleaving(SVI)) {
				LLVM_DEBUG(dbgs() << "SVI doesn't appear to perform interleaving"
				<< ".\n");
				return false;
				}

				if (matchComplexMul(SVI, G))
				return true;

				if (matchComplexAdd(SVI, G))
				return true;

				if (TTI->matchComplexArithmeticIR(SVI, G))
				return true;
				}

				return false;
				}

				static bool substituteGraph(TargetTransformInfo TTI, Instruction I,
				ComplexArithmeticGraph &G,
				SmallVector<Instruction *> &DeadInsts) {
				G.setCurrentInstruction(I);

				SmallVector<Instruction *> RealShuffles = G.getNodesOfType(
				ComplexArithmeticGraph::Shuffle \| ComplexArithmeticGraph::Real);

				auto *LoadA = RealShuffles[0]->getOperand(0);
				auto *LoadB = RealShuffles[1]->getOperand(0);

				auto *TyA = cast<FixedVectorType>(LoadA->getType());

				const unsigned MaxVectorWidth = 128;
				unsigned NumBits = TyA->getScalarSizeInBits() * TyA->getNumElements();
				unsigned NumElementsPerVector = MaxVectorWidth / TyA->getScalarSizeInBits();
				int GeneratedIntrinsics;
				if (NumBits > MaxVectorWidth) {
				LLVM_DEBUG(dbgs() << "Split required, " << NumBits
				<< " is greater than the max vector width ("
				<< MaxVectorWidth << ")"
				<< ".\n");
				if (NumBits % MaxVectorWidth != 0) {
				LLVM_DEBUG(dbgs() << "Vector can't be split evenly"
				<< ".\n");
				return false;
				}

				IRBuilder<> B(I);

				unsigned SplitCount = NumBits / MaxVectorWidth;

				if (SplitCount > 2) {
				LLVM_DEBUG(dbgs() << "Cannot split operation beyond 2"
				<< ".\n");
				return false;
				}

				SmallVector<Value *> CreatedInsts;
				SmallVector<Value *> ComplexIR;
				for (unsigned i = 0; i < SplitCount; ++i) {
				ArrayRef<int> Mask =
				createContiguousMask(NumElementsPerVector, NumElementsPerVector * i);
				auto *Undef = UndefValue::get(LoadA->getType());
				auto *ShuffleA = B.CreateShuffleVector(LoadA, Undef, Mask);
				auto *ShuffleB = B.CreateShuffleVector(LoadB, Undef, Mask);

				CreatedInsts.push_back(ShuffleA);
				CreatedInsts.push_back(ShuffleB);

				auto *IR = TTI->createComplexArithmeticIR(G, ShuffleA, ShuffleB,
				GeneratedIntrinsics);
				if (IR == nullptr) {
				for (auto &item : CreatedInsts)
				DeadInsts.push_back(cast<Instruction>(item));
				return false;
				}
				NumComplexIntrinsics += GeneratedIntrinsics;
				ComplexIR.push_back(IR);
				CreatedInsts.push_back(IR);
				}

				ArrayRef<int> Mask = createContiguousMask(NumElementsPerVector * 2);
				auto *Shuffle = B.CreateShuffleVector(ComplexIR[0], ComplexIR[1], Mask);
				I->replaceAllUsesWith(Shuffle);
				} else {
				auto *Mla =
				TTI->createComplexArithmeticIR(G, LoadA, LoadB, GeneratedIntrinsics);
				if (Mla == nullptr)
				return false;
				NumComplexIntrinsics += GeneratedIntrinsics;
				I->replaceAllUsesWith(Mla);
				}

				for (auto &item : G.getAllNodes())
				DeadInsts.push_back(item);

				return true;
				}

				bool ComplexArithmetic::evaluateComplexArithmeticBasicBlock(
				BasicBlock B, SmallVector<Instruction > &DeadInsts) {
				ComplexArithmeticGraph Graph;

				bool Changed = false;
				bool Substituted = false;

				for (auto &I : *B) {
				if (auto *SVI = dyn_cast<ShuffleVectorInst>(&I)) {
				jcranmer-intelUnsubmitted Not Done Reply Inline Actions In my experience with complex instructions, one of the big issues is that the varied frontends end up creating a variety of patterns for complex arithmetic, so you end up with a random choice of a complex number being represented in LLVM IR as a struct, a vector, or pairs of scalar numbers (especially as the ABIs for passing or returning them through functions is so insane). That's leaving aside the question of how the complex multiply itself is represented (is it expanded in IR, or a call to `__mulsc3`, or both?). By starting the search for a complex multiply at a shufflevector, you're really leaving a lot of opportunities to match complex multiplies off the table. The pattern-matching I did in D119288 looks for insertvalue, insertelement, and matching stores for things that might be the result of complex multiplies or divisions. (This may also be because you're aiming to do this after vectorization, and one of the problems that I'm trying to solve with complex intrinsics is that the patterns of complex instructions generated by the frontend aren't always kosher for vectorization). jcranmer-intel: In my experience with complex instructions, one of the big issues is that the varied frontends…
				if (isInterleaving(SVI)) {
				Graph.addNode(SVI, ComplexArithmeticGraph::Shuffle);
				Changed = traverseAndPopulateGraph(TTI, SVI, Graph);
				}
				}
				if (Changed) {
				Substituted = substituteGraph(TTI, &I, Graph, DeadInsts);
				Changed = false;
				}
				}

				return Substituted;
				}

				void ComplexArithmetic::cleanupDeadInsts(
				SmallVector<Instruction *> &DeadInsts) {

				// TODO clean up the dead instructions better. (Ask in review?)
				unsigned iter = 0;
				unsigned count = DeadInsts.size();
				unsigned remaining = DeadInsts.size();
				while (!DeadInsts.empty() && remaining > 0 && iter < count) {
				++iter;
				remaining = 0;
				for (auto *It = DeadInsts.begin(); It != DeadInsts.end(); It++) {
				auto I = It;

				if (I->getParent())
				remaining++;

				if (I->getNumUses() == 0 && I->getParent()) {
				remaining--;
				I->eraseFromParent();
				}
				}
				}

				DeadInsts.clear();
				}

llvm/test/CodeGen/ARM/O3-pipeline.ll

	; RUN: llc -mtriple=arm -O3 -debug-pass=Structure < %s -o /dev/null 2>&1 \| grep -v "Verify generated machine code" \| FileCheck %s			; RUN: llc -mtriple=arm -O3 -debug-pass=Structure < %s -o /dev/null 2>&1 \| grep -v "Verify generated machine code" \| FileCheck %s
				dmgreenUnsubmitted Done Reply Inline Actions This looks like it was incorrectly added for this test. dmgreen: This looks like it was incorrectly added for this test.

	; REQUIRES: asserts			; REQUIRES: asserts

	; CHECK: ModulePass Manager			; CHECK: ModulePass Manager
	; CHECK-NEXT: Pre-ISel Intrinsic Lowering			; CHECK-NEXT: Pre-ISel Intrinsic Lowering
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	; CHECK-NEXT: Expand Atomic instructions			; CHECK-NEXT: Expand Atomic instructions
	Show All 31 Lines
	; CHECK-NEXT: Expand vector predication intrinsics			; CHECK-NEXT: Expand vector predication intrinsics
	; CHECK-NEXT: Scalarize Masked Memory Intrinsics			; CHECK-NEXT: Scalarize Masked Memory Intrinsics
	; CHECK-NEXT: Expand reduction intrinsics			; CHECK-NEXT: Expand reduction intrinsics
	; CHECK-NEXT: Natural Loop Information			; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: Scalar Evolution Analysis			; CHECK-NEXT: Scalar Evolution Analysis
	; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)			; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)
	; CHECK-NEXT: Function Alias Analysis Results			; CHECK-NEXT: Function Alias Analysis Results
	; CHECK-NEXT: Transform functions to use DSP intrinsics			; CHECK-NEXT: Transform functions to use DSP intrinsics
				; CHECK-NEXT: Complex Arithmetic Pass
	; CHECK-NEXT: Interleaved Access Pass			; CHECK-NEXT: Interleaved Access Pass
	; CHECK-NEXT: Type Promotion			; CHECK-NEXT: Type Promotion
	; CHECK-NEXT: CodeGen Prepare			; CHECK-NEXT: CodeGen Prepare
	; CHECK-NEXT: Dominator Tree Construction			; CHECK-NEXT: Dominator Tree Construction
	; CHECK-NEXT: Exception handling preparation			; CHECK-NEXT: Exception handling preparation
	; CHECK-NEXT: Merge internal globals			; CHECK-NEXT: Merge internal globals
	; CHECK-NEXT: Natural Loop Information			; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: Scalar Evolution Analysis			; CHECK-NEXT: Scalar Evolution Analysis
	▲ Show 20 Lines • Show All 137 Lines • Show Last 20 Lines

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f16-add.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -o - \| FileCheck %s
				; RUN: llc < %s -o /dev/null -stats -stats-json 2>&1 \| FileCheck %s --check-prefix=STATS

				; STATS: "complex-arithmetic.NumComplexIntrinsics": 3

				target triple = "arm-arm-none-eabi"
				attributes #0 = { "target-cpu"="cortex-m55" }

				define void @complex_add_v2f16(<2 x half> %wide.vec, <2 x half> %wide.vec23, <2 x half>* %lsr.iv5153) #0 {
				; CHECK-LABEL: complex_add_v2f16:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: vmov d0, r2, r3
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: ldr r0, [sp, #16]
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB0_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vmovx.f16 s2, s0
				; CHECK-NEXT: vsub.f16 s4, s0, s2
				; CHECK-NEXT: vadd.f16 s2, s2, s0
				; CHECK-NEXT: vins.f16 s4, s2
				; CHECK-NEXT: vmov r1, s4
				; CHECK-NEXT: str r1, [r0]
				; CHECK-NEXT: b .LBB0_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv51531 = bitcast { half, half }* null to <2 x half>*
				%wide.vec2 = load <2 x half>, <2 x half>* null, align 4
				%strided.vec = shufflevector <2 x half> %wide.vec, <2 x half> zeroinitializer, <1 x i32> <i32 0>
				%strided.vec22 = shufflevector <2 x half> %wide.vec, <2 x half> zeroinitializer, <1 x i32> <i32 1>
				%wide.vec233 = load <2 x half>, <2 x half>* null, align 4
				%strided.vec24 = shufflevector <2 x half> %wide.vec, <2 x half> zeroinitializer, <1 x i32> <i32 0>
				%strided.vec25 = shufflevector <2 x half> %wide.vec, <2 x half> zeroinitializer, <1 x i32> <i32 1>
				%0 = fsub fast <1 x half> %strided.vec24, %strided.vec22
				%1 = fadd fast <1 x half> %strided.vec25, %strided.vec
				%interleaved.vec = shufflevector <1 x half> %0, <1 x half> %1, <2 x i32> <i32 0, i32 1>
				store <2 x half> %interleaved.vec, <2 x half>* %lsr.iv5153, align 4
				br label %vector.body
				}

				define void @complex_add_v4f16(<4 x half> %wide.vec, <4 x half> %wide.vec23, <4 x half>* %lsr.iv5153) #0 {
				; CHECK-LABEL: complex_add_v4f16:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: vmov d0, r2, r3
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: ldr r0, [sp, #16]
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB1_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vmovx.f16 s2, s1
				; CHECK-NEXT: vmovx.f16 s4, s0
				; CHECK-NEXT: vmov.f32 s8, s0
				; CHECK-NEXT: vins.f16 s4, s2
				; CHECK-NEXT: vins.f16 s8, s1
				; CHECK-NEXT: vsub.f16 q3, q2, q1
				; CHECK-NEXT: vadd.f16 q1, q1, q2
				; CHECK-NEXT: vmovx.f16 s13, s12
				; CHECK-NEXT: vmovx.f16 s2, s4
				; CHECK-NEXT: vins.f16 s12, s4
				; CHECK-NEXT: vins.f16 s13, s2
				; CHECK-NEXT: vmov r1, r2, d6
				; CHECK-NEXT: strd r1, r2, [r0]
				; CHECK-NEXT: b .LBB1_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv51531 = bitcast { half, half }* null to <4 x half>*
				%wide.vec2 = load <4 x half>, <4 x half>* null, align 4
				%strided.vec = shufflevector <4 x half> %wide.vec, <4 x half> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%strided.vec22 = shufflevector <4 x half> %wide.vec, <4 x half> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%wide.vec233 = load <4 x half>, <4 x half>* null, align 4
				%strided.vec24 = shufflevector <4 x half> %wide.vec, <4 x half> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%strided.vec25 = shufflevector <4 x half> %wide.vec, <4 x half> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%0 = fsub fast <2 x half> %strided.vec24, %strided.vec22
				%1 = fadd fast <2 x half> %strided.vec25, %strided.vec
				%interleaved.vec = shufflevector <2 x half> %0, <2 x half> %1, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				store <4 x half> %interleaved.vec, <4 x half>* %lsr.iv5153, align 4
				br label %vector.body
				}

				define void @complex_add_v8f16(<8 x half> %wide.vec, <8 x half> %wide.vec23, <8 x half>* %lsr.iv5153) #0 {
				; CHECK-LABEL: complex_add_v8f16:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: vmov d1, r2, r3
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: ldr r0, [sp, #16]
				; CHECK-NEXT: vcadd.f16 q0, q0, q0, #90
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB2_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: b .LBB2_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv51531 = bitcast { half, half }* null to <8 x half>*
				%wide.vec2 = load <8 x half>, <8 x half>* null, align 4
				%strided.vec = shufflevector <8 x half> %wide.vec, <8 x half> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec22 = shufflevector <8 x half> %wide.vec, <8 x half> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%wide.vec233 = load <8 x half>, <8 x half>* null, align 4
				%strided.vec24 = shufflevector <8 x half> %wide.vec, <8 x half> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec25 = shufflevector <8 x half> %wide.vec, <8 x half> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fsub fast <4 x half> %strided.vec24, %strided.vec22
				%1 = fadd fast <4 x half> %strided.vec25, %strided.vec
				%interleaved.vec = shufflevector <4 x half> %0, <4 x half> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x half> %interleaved.vec, <8 x half>* %lsr.iv5153, align 4
				br label %vector.body
				}

				define void @complex_add_v16f16(<16 x half> %wide.vec, <16 x half> %wide.vec23, <16 x half>* %lsr.iv5153) #0 {
				; CHECK-LABEL: complex_add_v16f16:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: mov r0, sp
				; CHECK-NEXT: vmov d1, r2, r3
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: ldr r0, [sp, #48]
				; CHECK-NEXT: vcadd.f16 q0, q0, q0, #90
				; CHECK-NEXT: vcadd.f16 q1, q1, q1, #90
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB3_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vstrw.32 q1, [r0, #16]
				; CHECK-NEXT: b .LBB3_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv51531 = bitcast { half, half }* null to <16 x half>*
				%wide.vec2 = load <16 x half>, <16 x half>* null, align 4
				%strided.vec = shufflevector <16 x half> %wide.vec, <16 x half> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec22 = shufflevector <16 x half> %wide.vec, <16 x half> zeroinitializer, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%wide.vec233 = load <16 x half>, <16 x half>* null, align 4
				%strided.vec24 = shufflevector <16 x half> %wide.vec, <16 x half> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec25 = shufflevector <16 x half> %wide.vec, <16 x half> zeroinitializer, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fsub fast <8 x half> %strided.vec24, %strided.vec22
				%1 = fadd fast <8 x half> %strided.vec25, %strided.vec
				%interleaved.vec = shufflevector <8 x half> %0, <8 x half> %1, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store <16 x half> %interleaved.vec, <16 x half>* %lsr.iv5153, align 4
				br label %vector.body
				}

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f16-mul.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -o - \| FileCheck %s
				; RUN: llc < %s -o /dev/null -stats -stats-json 2>&1 \| FileCheck %s --check-prefix=STATS

				; STATS: "complex-arithmetic.NumComplexIntrinsics": 6

				target triple = "arm-arm-none-eabi"
				attributes #0 = { "target-cpu"="cortex-m55" }

				define void @complex_mul_v2f16(half* %a, half* %b, half* %c) #0 {
				; CHECK-LABEL: complex_mul_v2f16:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB0_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: ldr r2, [r0]
				; CHECK-NEXT: ldr r3, [r1]
				; CHECK-NEXT: vmov.32 q0[0], r2
				; CHECK-NEXT: vmov.32 q1[0], r3
				; CHECK-NEXT: vmovx.f16 s2, s0
				; CHECK-NEXT: vmovx.f16 s6, s4
				; CHECK-NEXT: vmul.f16 s8, s6, s2
				; CHECK-NEXT: vmul.f16 s2, s4, s2
				; CHECK-NEXT: vfnms.f16 s8, s4, s0
				; CHECK-NEXT: vfma.f16 s2, s6, s0
				; CHECK-NEXT: vins.f16 s8, s2
				; CHECK-NEXT: vmov r2, s8
				; CHECK-NEXT: str r2, [r0]
				; CHECK-NEXT: b .LBB0_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast half* %a to <2 x half>*
				%b.ptr = bitcast half* %b to <2 x half>*
				%c.ptr = bitcast half* %c to <2 x half>*
				%a.val = load <2 x half>, <2 x half>* %a.ptr
				%b.val = load <2 x half>, <2 x half>* %b.ptr
				%strided.vec = shufflevector <2 x half> %a.val, <2 x half> poison, <1 x i32> <i32 0>
				%strided.vec46 = shufflevector <2 x half> %a.val, <2 x half> poison, <1 x i32> <i32 1>
				%strided.vec48 = shufflevector <2 x half> %b.val, <2 x half> poison, <1 x i32> <i32 0>
				%strided.vec49 = shufflevector <2 x half> %b.val, <2 x half> poison, <1 x i32> <i32 1>
				%0 = fmul fast <1 x half> %strided.vec48, %strided.vec
				%1 = fmul fast <1 x half> %strided.vec49, %strided.vec46
				%2 = fsub fast <1 x half> %0, %1
				%3 = fmul fast <1 x half> %strided.vec49, %strided.vec
				%4 = fmul fast <1 x half> %strided.vec48, %strided.vec46
				%5 = fadd fast <1 x half> %3, %4
				%6 = bitcast half* undef to <2 x half>*
				%interleaved.vec = shufflevector <1 x half> %2, <1 x half> %5, <2 x i32> <i32 0, i32 1>
				store <2 x half> %interleaved.vec, <2 x half>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v4f16(half* %a, half* %b, half* %c) #0 {
				; CHECK-LABEL: complex_mul_v4f16:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .vsave {d8}
				; CHECK-NEXT: vpush {d8}
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB1_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: ldrd r2, r12, [r0]
				; CHECK-NEXT: vmov.32 q0[0], r2
				; CHECK-NEXT: ldrd r2, r3, [r1]
				; CHECK-NEXT: vmov.32 q0[1], r12
				; CHECK-NEXT: vmov.32 q1[0], r2
				; CHECK-NEXT: vmov.32 q1[1], r3
				; CHECK-NEXT: vmovx.f16 s8, s0
				; CHECK-NEXT: vmovx.f16 s2, s1
				; CHECK-NEXT: vins.f16 s8, s2
				; CHECK-NEXT: vmovx.f16 s16, s4
				; CHECK-NEXT: vmovx.f16 s2, s5
				; CHECK-NEXT: vins.f16 s16, s2
				; CHECK-NEXT: vmul.f16 q3, q4, q2
				; CHECK-NEXT: vins.f16 s0, s1
				; CHECK-NEXT: vins.f16 s4, s5
				; CHECK-NEXT: vneg.f16 q3, q3
				; CHECK-NEXT: vfma.f16 q3, q1, q0
				; CHECK-NEXT: vmul.f16 q1, q1, q2
				; CHECK-NEXT: vfma.f16 q1, q4, q0
				; CHECK-NEXT: vmovx.f16 s13, s12
				; CHECK-NEXT: vmovx.f16 s0, s4
				; CHECK-NEXT: vins.f16 s12, s4
				; CHECK-NEXT: vins.f16 s13, s0
				; CHECK-NEXT: vmov r2, r3, d6
				; CHECK-NEXT: str r3, [r0]
				; CHECK-NEXT: str r2, [r0]
				; CHECK-NEXT: b .LBB1_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast half* %a to <4 x half>*
				%b.ptr = bitcast half* %b to <4 x half>*
				%c.ptr = bitcast half* %c to <4 x half>*
				%a.val = load <4 x half>, <4 x half>* %a.ptr
				%b.val = load <4 x half>, <4 x half>* %b.ptr
				%strided.vec = shufflevector <4 x half> %a.val, <4 x half> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec46 = shufflevector <4 x half> %a.val, <4 x half> poison, <2 x i32> <i32 1, i32 3>
				%strided.vec48 = shufflevector <4 x half> %b.val, <4 x half> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec49 = shufflevector <4 x half> %b.val, <4 x half> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x half> %strided.vec48, %strided.vec
				%1 = fmul fast <2 x half> %strided.vec49, %strided.vec46
				%2 = fsub fast <2 x half> %0, %1
				%3 = fmul fast <2 x half> %strided.vec49, %strided.vec
				%4 = fmul fast <2 x half> %strided.vec48, %strided.vec46
				%5 = fadd fast <2 x half> %3, %4
				%6 = bitcast half* undef to <4 x half>*
				%interleaved.vec = shufflevector <2 x half> %2, <2 x half> %5, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				store <4 x half> %interleaved.vec, <4 x half>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v8f16(half* %a, half* %b, half* %c) #0 {
				; CHECK-LABEL: complex_mul_v8f16:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB2_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0]
				; CHECK-NEXT: vldrw.u32 q1, [r1]
				; CHECK-NEXT: vcmul.f16 q2, q0, q1, #0
				; CHECK-NEXT: vcmla.f16 q2, q0, q1, #90
				; CHECK-NEXT: vstrw.32 q2, [r0]
				; CHECK-NEXT: b .LBB2_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast half* %a to <8 x half>*
				%b.ptr = bitcast half* %b to <8 x half>*
				%c.ptr = bitcast half* %c to <8 x half>*
				%a.val = load <8 x half>, <8 x half>* %a.ptr
				%b.val = load <8 x half>, <8 x half>* %b.ptr
				%strided.vec = shufflevector <8 x half> %a.val, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec46 = shufflevector <8 x half> %a.val, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec48 = shufflevector <8 x half> %b.val, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec49 = shufflevector <8 x half> %b.val, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x half> %strided.vec48, %strided.vec
				%1 = fmul fast <4 x half> %strided.vec49, %strided.vec46
				%2 = fsub fast <4 x half> %0, %1
				%3 = fmul fast <4 x half> %strided.vec49, %strided.vec
				%4 = fmul fast <4 x half> %strided.vec48, %strided.vec46
				%5 = fadd fast <4 x half> %3, %4
				%6 = bitcast half* undef to <8 x half>*
				%interleaved.vec = shufflevector <4 x half> %2, <4 x half> %5, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x half> %interleaved.vec, <8 x half>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v16f16(half* %a, half* %b, half* %c) #0 {
				; CHECK-LABEL: complex_mul_v16f16:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .vsave {d8, d9}
				; CHECK-NEXT: vpush {d8, d9}
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB3_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vldrw.u32 q0, [r0, #16]
				; CHECK-NEXT: vldrw.u32 q3, [r1, #16]
				; CHECK-NEXT: vcmul.f16 q4, q1, q2, #0
				; CHECK-NEXT: vcmla.f16 q4, q1, q2, #90
				; CHECK-NEXT: vcmul.f16 q1, q0, q3, #0
				; CHECK-NEXT: vcmla.f16 q1, q0, q3, #90
				; CHECK-NEXT: vstrw.32 q1, [r0]
				; CHECK-NEXT: vstrw.32 q4, [r0]
				; CHECK-NEXT: b .LBB3_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast half* %a to <16 x half>*
				%b.ptr = bitcast half* %b to <16 x half>*
				%c.ptr = bitcast half* %c to <16 x half>*
				%a.val = load <16 x half>, <16 x half>* %a.ptr
				%b.val = load <16 x half>, <16 x half>* %b.ptr
				%strided.vec = shufflevector <16 x half> %a.val, <16 x half> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec46 = shufflevector <16 x half> %a.val, <16 x half> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%strided.vec48 = shufflevector <16 x half> %b.val, <16 x half> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec49 = shufflevector <16 x half> %b.val, <16 x half> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fmul fast <8 x half> %strided.vec48, %strided.vec
				%1 = fmul fast <8 x half> %strided.vec49, %strided.vec46
				%2 = fsub fast <8 x half> %0, %1
				%3 = fmul fast <8 x half> %strided.vec49, %strided.vec
				%4 = fmul fast <8 x half> %strided.vec48, %strided.vec46
				%5 = fadd fast <8 x half> %3, %4
				%6 = bitcast half* undef to <16 x half>*
				%interleaved.vec = shufflevector <8 x half> %2, <8 x half> %5, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store <16 x half> %interleaved.vec, <16 x half>* %6, align 4
				br label %vector.body
				}

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f32-add.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -o - \| FileCheck %s
				; RUN: llc < %s -o /dev/null -stats -stats-json 2>&1 \| FileCheck %s --check-prefix=STATS

				; STATS: "complex-arithmetic.NumComplexIntrinsics": 3

				target triple = "arm-arm-none-eabi"
				attributes #0 = { "target-cpu"="cortex-m55" }

				define void @complex_add_v2f32(<2 x float> %wide.vec, <2 x float> %wide.vec23, <2 x float>* %lsr.iv5153) #0 {
				; CHECK-LABEL: complex_add_v2f32:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: vmov d0, r2, r3
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: ldr r0, [sp, #16]
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB0_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vsub.f32 s2, s0, s1
				; CHECK-NEXT: vadd.f32 s4, s1, s0
				; CHECK-NEXT: vstr s2, [r0]
				; CHECK-NEXT: vstr s4, [r0, #4]
				; CHECK-NEXT: b .LBB0_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv51531 = bitcast { float, float }* null to <2 x float>*
				%wide.vec2 = load <2 x float>, <2 x float>* null, align 4
				%strided.vec = shufflevector <2 x float> %wide.vec, <2 x float> zeroinitializer, <1 x i32> <i32 0>
				%strided.vec22 = shufflevector <2 x float> %wide.vec, <2 x float> zeroinitializer, <1 x i32> <i32 1>
				%wide.vec233 = load <2 x float>, <2 x float>* null, align 4
				%strided.vec24 = shufflevector <2 x float> %wide.vec, <2 x float> zeroinitializer, <1 x i32> <i32 0>
				%strided.vec25 = shufflevector <2 x float> %wide.vec, <2 x float> zeroinitializer, <1 x i32> <i32 1>
				%0 = fsub fast <1 x float> %strided.vec24, %strided.vec22
				%1 = fadd fast <1 x float> %strided.vec25, %strided.vec
				%interleaved.vec = shufflevector <1 x float> %0, <1 x float> %1, <2 x i32> <i32 0, i32 1>
				store <2 x float> %interleaved.vec, <2 x float>* %lsr.iv5153, align 4
				br label %vector.body
				}

				define void @complex_add_v4f32(<4 x float> %wide.vec, <4 x float> %wide.vec23, <4 x float>* %lsr.iv5153) #0 {
				; CHECK-LABEL: complex_add_v4f32:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: vmov d3, r2, r3
				; CHECK-NEXT: vmov d2, r0, r1
				; CHECK-NEXT: ldr r0, [sp, #16]
				; CHECK-NEXT: vcadd.f32 q0, q1, q1, #90
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB1_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: b .LBB1_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv51531 = bitcast { float, float }* null to <4 x float>*
				%wide.vec2 = load <4 x float>, <4 x float>* null, align 4
				%strided.vec = shufflevector <4 x float> %wide.vec, <4 x float> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%strided.vec22 = shufflevector <4 x float> %wide.vec, <4 x float> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%wide.vec233 = load <4 x float>, <4 x float>* null, align 4
				%strided.vec24 = shufflevector <4 x float> %wide.vec, <4 x float> zeroinitializer, <2 x i32> <i32 0, i32 2>
				%strided.vec25 = shufflevector <4 x float> %wide.vec, <4 x float> zeroinitializer, <2 x i32> <i32 1, i32 3>
				%0 = fsub fast <2 x float> %strided.vec24, %strided.vec22
				%1 = fadd fast <2 x float> %strided.vec25, %strided.vec
				%interleaved.vec = shufflevector <2 x float> %0, <2 x float> %1, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				store <4 x float> %interleaved.vec, <4 x float>* %lsr.iv5153, align 4
				br label %vector.body
				}

				define void @complex_add_v8f32(<8 x float> %wide.vec, <8 x float> %wide.vec23, <8 x float>* %lsr.iv5153) #0 {
				; CHECK-LABEL: complex_add_v8f32:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: vmov d2, r0, r1
				; CHECK-NEXT: mov r0, sp
				; CHECK-NEXT: vmov d3, r2, r3
				; CHECK-NEXT: vldrw.u32 q2, [r0]
				; CHECK-NEXT: ldr r0, [sp, #48]
				; CHECK-NEXT: vcadd.f32 q0, q1, q1, #90
				; CHECK-NEXT: vcadd.f32 q1, q2, q2, #90
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB2_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vstrw.32 q1, [r0, #16]
				; CHECK-NEXT: b .LBB2_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv51531 = bitcast { float, float }* null to <8 x float>*
				%wide.vec2 = load <8 x float>, <8 x float>* null, align 4
				%strided.vec = shufflevector <8 x float> %wide.vec, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec22 = shufflevector <8 x float> %wide.vec, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%wide.vec233 = load <8 x float>, <8 x float>* null, align 4
				%strided.vec24 = shufflevector <8 x float> %wide.vec, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec25 = shufflevector <8 x float> %wide.vec, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fsub fast <4 x float> %strided.vec24, %strided.vec22
				%1 = fadd fast <4 x float> %strided.vec25, %strided.vec
				%interleaved.vec = shufflevector <4 x float> %0, <4 x float> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x float> %interleaved.vec, <8 x float>* %lsr.iv5153, align 4
				br label %vector.body
				}

				define void @complex_add_v16f32(<16 x float> %wide.vec, <16 x float> %wide.vec23, <16 x float>* %lsr.iv5153) #0 {
				; CHECK-LABEL: complex_add_v16f32:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: .pad #48
				; CHECK-NEXT: sub sp, #48
				; CHECK-NEXT: vmov d1, r2, r3
				; CHECK-NEXT: add r2, sp, #112
				; CHECK-NEXT: vstrw.32 q0, [sp, #32] @ 16-byte Spill
				; CHECK-NEXT: vmov d0, r0, r1
				; CHECK-NEXT: add r0, sp, #144
				; CHECK-NEXT: add r1, sp, #128
				; CHECK-NEXT: vstrw.32 q0, [sp, #16] @ 16-byte Spill
				; CHECK-NEXT: vldrw.u32 q0, [r0]
				; CHECK-NEXT: ldr r0, [sp, #224]
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vldrw.u32 q3, [r2]
				; CHECK-NEXT: vstrw.32 q0, [sp] @ 16-byte Spill
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB3_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [sp] @ 16-byte Reload
				; CHECK-NEXT: vldrw.u32 q1, [sp, #16] @ 16-byte Reload
				; CHECK-NEXT: vmov.f32 s18, s0
				; CHECK-NEXT: vmov.f32 s19, s2
				; CHECK-NEXT: vmov.f32 s22, s1
				; CHECK-NEXT: vmov.f32 s23, s3
				; CHECK-NEXT: vldrw.u32 q0, [sp, #32] @ 16-byte Reload
				; CHECK-NEXT: vmov.f32 s16, s8
				; CHECK-NEXT: vmov.f32 s17, s10
				; CHECK-NEXT: vmov.f32 s20, s9
				; CHECK-NEXT: vmov.f32 s21, s11
				; CHECK-NEXT: vsub.f32 q6, q4, q5
				; CHECK-NEXT: vadd.f32 q7, q5, q4
				; CHECK-NEXT: vmov.f32 s16, s4
				; CHECK-NEXT: vmov.f32 s17, s2
				; CHECK-NEXT: vmov.f32 s18, s12
				; CHECK-NEXT: vmov.f32 s19, s14
				; CHECK-NEXT: vmov.f32 s20, s5
				; CHECK-NEXT: vmov.f32 s21, s3
				; CHECK-NEXT: vmov.f32 s22, s13
				; CHECK-NEXT: vmov.f32 s23, s15
				; CHECK-NEXT: vsub.f32 q0, q4, q5
				; CHECK-NEXT: vadd.f32 q1, q5, q4
				; CHECK-NEXT: mov r1, r0
				; CHECK-NEXT: vst20.32 {q0, q1}, [r0]
				; CHECK-NEXT: vst21.32 {q0, q1}, [r1]!
				; CHECK-NEXT: vst20.32 {q6, q7}, [r1]
				; CHECK-NEXT: vst21.32 {q6, q7}, [r1]
				; CHECK-NEXT: b .LBB3_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%lsr.iv51531 = bitcast { float, float }* null to <16 x float>*
				%wide.vec2 = load <16 x float>, <16 x float>* null, align 4
				%strided.vec = shufflevector <16 x float> %wide.vec, <16 x float> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec22 = shufflevector <16 x float> %wide.vec, <16 x float> zeroinitializer, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%wide.vec233 = load <16 x float>, <16 x float>* null, align 4
				%strided.vec24 = shufflevector <16 x float> %wide.vec, <16 x float> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec25 = shufflevector <16 x float> %wide.vec, <16 x float> zeroinitializer, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fsub fast <8 x float> %strided.vec24, %strided.vec22
				%1 = fadd fast <8 x float> %strided.vec25, %strided.vec
				%interleaved.vec = shufflevector <8 x float> %0, <8 x float> %1, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store <16 x float> %interleaved.vec, <16 x float>* %lsr.iv5153, align 4
				br label %vector.body
				}

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f32-mul.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -o - \| FileCheck %s
				; RUN: llc < %s -o /dev/null -stats -stats-json 2>&1 \| FileCheck %s --check-prefix=STATS

				; STATS: "complex-arithmetic.NumComplexIntrinsics": 6

				target triple = "arm-arm-none-eabi"
				attributes #0 = { "target-cpu"="cortex-m55" }

				define void @complex_mul_v2f32(float* %a, float* %b, float* %c) #0 {
				; CHECK-LABEL: complex_mul_v2f32:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB0_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldr s2, [r0, #4]
				; CHECK-NEXT: vldr s4, [r1, #4]
				; CHECK-NEXT: vldr s6, [r1]
				; CHECK-NEXT: vldr s0, [r0]
				; CHECK-NEXT: vmul.f32 s8, s4, s2
				; CHECK-NEXT: vmul.f32 s2, s6, s2
				; CHECK-NEXT: vfma.f32 s2, s4, s0
				; CHECK-NEXT: vfnms.f32 s8, s6, s0
				; CHECK-NEXT: vstr s2, [r0]
				; CHECK-NEXT: vstr s8, [r0]
				; CHECK-NEXT: b .LBB0_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast float* %a to <2 x float>*
				%b.ptr = bitcast float* %b to <2 x float>*
				%c.ptr = bitcast float* %c to <2 x float>*
				%a.val = load <2 x float>, <2 x float>* %a.ptr
				%b.val = load <2 x float>, <2 x float>* %b.ptr
				%strided.vec = shufflevector <2 x float> %a.val, <2 x float> poison, <1 x i32> <i32 0>
				%strided.vec46 = shufflevector <2 x float> %a.val, <2 x float> poison, <1 x i32> <i32 1>
				%strided.vec48 = shufflevector <2 x float> %b.val, <2 x float> poison, <1 x i32> <i32 0>
				%strided.vec49 = shufflevector <2 x float> %b.val, <2 x float> poison, <1 x i32> <i32 1>
				%0 = fmul fast <1 x float> %strided.vec48, %strided.vec
				%1 = fmul fast <1 x float> %strided.vec49, %strided.vec46
				%2 = fsub fast <1 x float> %0, %1
				%3 = fmul fast <1 x float> %strided.vec49, %strided.vec
				%4 = fmul fast <1 x float> %strided.vec48, %strided.vec46
				%5 = fadd fast <1 x float> %3, %4
				%6 = bitcast float* undef to <2 x float>*
				%interleaved.vec = shufflevector <1 x float> %2, <1 x float> %5, <2 x i32> <i32 0, i32 1>
				store <2 x float> %interleaved.vec, <2 x float>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v4f32(float* %a, float* %b, float* %c) #0 {
				; CHECK-LABEL: complex_mul_v4f32:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB1_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0]
				; CHECK-NEXT: vldrw.u32 q1, [r1]
				; CHECK-NEXT: vcmul.f32 q2, q0, q1, #0
				; CHECK-NEXT: vcmla.f32 q2, q0, q1, #90
				; CHECK-NEXT: vstrw.32 q2, [r0]
				; CHECK-NEXT: b .LBB1_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast float* %a to <4 x float>*
				%b.ptr = bitcast float* %b to <4 x float>*
				%c.ptr = bitcast float* %c to <4 x float>*
				%a.val = load <4 x float>, <4 x float>* %a.ptr
				%b.val = load <4 x float>, <4 x float>* %b.ptr
				%strided.vec = shufflevector <4 x float> %a.val, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec46 = shufflevector <4 x float> %a.val, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%strided.vec48 = shufflevector <4 x float> %b.val, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec49 = shufflevector <4 x float> %b.val, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x float> %strided.vec48, %strided.vec
				%1 = fmul fast <2 x float> %strided.vec49, %strided.vec46
				%2 = fsub fast <2 x float> %0, %1
				%3 = fmul fast <2 x float> %strided.vec49, %strided.vec
				%4 = fmul fast <2 x float> %strided.vec48, %strided.vec46
				%5 = fadd fast <2 x float> %3, %4
				%6 = bitcast float* undef to <4 x float>*
				%interleaved.vec = shufflevector <2 x float> %2, <2 x float> %5, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				store <4 x float> %interleaved.vec, <4 x float>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v8f32(float* %a, float* %b, float* %c) #0 {
				; CHECK-LABEL: complex_mul_v8f32:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .vsave {d8, d9}
				; CHECK-NEXT: vpush {d8, d9}
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB2_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vldrw.u32 q0, [r0, #16]
				; CHECK-NEXT: vldrw.u32 q3, [r1, #16]
				; CHECK-NEXT: vcmul.f32 q4, q1, q2, #0
				; CHECK-NEXT: vcmla.f32 q4, q1, q2, #90
				; CHECK-NEXT: vcmul.f32 q1, q0, q3, #0
				; CHECK-NEXT: vcmla.f32 q1, q0, q3, #90
				; CHECK-NEXT: vstrw.32 q1, [r0]
				; CHECK-NEXT: vstrw.32 q4, [r0]
				; CHECK-NEXT: b .LBB2_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast float* %a to <8 x float>*
				%b.ptr = bitcast float* %b to <8 x float>*
				%c.ptr = bitcast float* %c to <8 x float>*
				%a.val = load <8 x float>, <8 x float>* %a.ptr
				%b.val = load <8 x float>, <8 x float>* %b.ptr
				%strided.vec = shufflevector <8 x float> %a.val, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec46 = shufflevector <8 x float> %a.val, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec48 = shufflevector <8 x float> %b.val, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec49 = shufflevector <8 x float> %b.val, <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x float> %strided.vec48, %strided.vec
				%1 = fmul fast <4 x float> %strided.vec49, %strided.vec46
				%2 = fsub fast <4 x float> %0, %1
				%3 = fmul fast <4 x float> %strided.vec49, %strided.vec
				%4 = fmul fast <4 x float> %strided.vec48, %strided.vec46
				%5 = fadd fast <4 x float> %3, %4
				%6 = bitcast float* undef to <8 x float>*
				%interleaved.vec = shufflevector <4 x float> %2, <4 x float> %5, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x float> %interleaved.vec, <8 x float>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v16f32(float* %a, float* %b, float* %c) #0 {
				; CHECK-LABEL: complex_mul_v16f32:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .save {r4, r5}
				; CHECK-NEXT: push {r4, r5}
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: .pad #32
				; CHECK-NEXT: sub sp, #32
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB3_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vld20.32 {q2, q3}, [r1]
				; CHECK-NEXT: mov r3, r1
				; CHECK-NEXT: vld20.32 {q0, q1}, [r0]
				; CHECK-NEXT: mov r2, r0
				; CHECK-NEXT: vld21.32 {q2, q3}, [r3]!
				; CHECK-NEXT: vld21.32 {q0, q1}, [r2]!
				; CHECK-NEXT: vld20.32 {q4, q5}, [r3]
				; CHECK-NEXT: vld20.32 {q6, q7}, [r2]
				; CHECK-NEXT: vld21.32 {q4, q5}, [r3]
				; CHECK-NEXT: vld21.32 {q6, q7}, [r2]
				; CHECK-NEXT: vstmia sp, {d4, d5, d6, d7} @ 32-byte Spill
				; CHECK-NEXT: vmov q2, q5
				; CHECK-NEXT: vmov q5, q4
				; CHECK-NEXT: vmul.f32 q4, q2, q7
				; CHECK-NEXT: vneg.f32 q4, q4
				; CHECK-NEXT: vfma.f32 q4, q5, q6
				; CHECK-NEXT: vmul.f32 q5, q5, q7
				; CHECK-NEXT: vfma.f32 q5, q2, q6
				; CHECK-NEXT: vldmia sp, {d4, d5, d6, d7} @ 32-byte Reload
				; CHECK-NEXT: vmul.f32 q6, q3, q1
				; CHECK-NEXT: vneg.f32 q6, q6
				; CHECK-NEXT: vmul.f32 q7, q2, q1
				; CHECK-NEXT: vfma.f32 q6, q2, q0
				; CHECK-NEXT: vfma.f32 q7, q3, q0
				; CHECK-NEXT: vst20.32 {q6, q7}, [r0]
				; CHECK-NEXT: vst21.32 {q6, q7}, [r0]
				; CHECK-NEXT: vst20.32 {q4, q5}, [r0]
				; CHECK-NEXT: vst21.32 {q4, q5}, [r0]
				; CHECK-NEXT: b .LBB3_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast float* %a to <16 x float>*
				%b.ptr = bitcast float* %b to <16 x float>*
				%c.ptr = bitcast float* %c to <16 x float>*
				%a.val = load <16 x float>, <16 x float>* %a.ptr
				%b.val = load <16 x float>, <16 x float>* %b.ptr
				%strided.vec = shufflevector <16 x float> %a.val, <16 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec46 = shufflevector <16 x float> %a.val, <16 x float> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%strided.vec48 = shufflevector <16 x float> %b.val, <16 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec49 = shufflevector <16 x float> %b.val, <16 x float> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fmul fast <8 x float> %strided.vec48, %strided.vec
				%1 = fmul fast <8 x float> %strided.vec49, %strided.vec46
				%2 = fsub fast <8 x float> %0, %1
				%3 = fmul fast <8 x float> %strided.vec49, %strided.vec
				%4 = fmul fast <8 x float> %strided.vec48, %strided.vec46
				%5 = fadd fast <8 x float> %3, %4
				%6 = bitcast float* undef to <16 x float>*
				%interleaved.vec = shufflevector <8 x float> %2, <8 x float> %5, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store <16 x float> %interleaved.vec, <16 x float>* %6, align 4
				br label %vector.body
				}

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f64-mul.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -o - \| FileCheck %s
				; RUN: llc < %s -o /dev/null -stats -stats-json 2>&1 \| FileCheck %s --check-prefix=STATS

				; NOTE: This statistic shouldn't appear, mve doesn't have f64 complex instructions
				; STATS-NOT: "complex-arithmetic.NumComplexIntrinsics"

				target triple = "arm-arm-none-eabi"
				attributes #0 = { "target-cpu"="cortex-m55" }

				define void @complex_mul_v2f64(double* %a, double* %b, double* %c) #0 {
				; CHECK-LABEL: complex_mul_v2f64:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB0_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0]
				; CHECK-NEXT: vldrw.u32 q1, [r1]
				; CHECK-NEXT: vmul.f64 d4, d3, d1
				; CHECK-NEXT: vmul.f64 d5, d2, d1
				; CHECK-NEXT: vfnms.f64 d4, d2, d0
				; CHECK-NEXT: vfma.f64 d5, d3, d0
				; CHECK-NEXT: vstrw.32 q2, [r0]
				; CHECK-NEXT: b .LBB0_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast double* %a to <2 x double>*
				%b.ptr = bitcast double* %b to <2 x double>*
				%c.ptr = bitcast double* %c to <2 x double>*
				%a.val = load <2 x double>, <2 x double>* %a.ptr
				%b.val = load <2 x double>, <2 x double>* %b.ptr
				%strided.vec = shufflevector <2 x double> %a.val, <2 x double> poison, <1 x i32> <i32 0>
				%strided.vec46 = shufflevector <2 x double> %a.val, <2 x double> poison, <1 x i32> <i32 1>
				%strided.vec48 = shufflevector <2 x double> %b.val, <2 x double> poison, <1 x i32> <i32 0>
				%strided.vec49 = shufflevector <2 x double> %b.val, <2 x double> poison, <1 x i32> <i32 1>
				%0 = fmul fast <1 x double> %strided.vec48, %strided.vec
				%1 = fmul fast <1 x double> %strided.vec49, %strided.vec46
				%2 = fsub fast <1 x double> %0, %1
				%3 = fmul fast <1 x double> %strided.vec49, %strided.vec
				%4 = fmul fast <1 x double> %strided.vec48, %strided.vec46
				%5 = fadd fast <1 x double> %3, %4
				%6 = bitcast double* undef to <2 x double>*
				%interleaved.vec = shufflevector <1 x double> %2, <1 x double> %5, <2 x i32> <i32 0, i32 1>
				store <2 x double> %interleaved.vec, <2 x double>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v4f64(double* %a, double* %b, double* %c) #0 {
				; CHECK-LABEL: complex_mul_v4f64:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .vsave {d8, d9}
				; CHECK-NEXT: vpush {d8, d9}
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB1_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0, #16]
				; CHECK-NEXT: vldrw.u32 q1, [r1, #16]
				; CHECK-NEXT: vldrw.u32 q3, [r0]
				; CHECK-NEXT: vldrw.u32 q4, [r1]
				; CHECK-NEXT: vmul.f64 d4, d3, d1
				; CHECK-NEXT: vmul.f64 d5, d2, d1
				; CHECK-NEXT: vfnms.f64 d4, d2, d0
				; CHECK-NEXT: vfma.f64 d5, d3, d0
				; CHECK-NEXT: vmul.f64 d0, d9, d7
				; CHECK-NEXT: vmul.f64 d1, d8, d7
				; CHECK-NEXT: vfnms.f64 d0, d8, d6
				; CHECK-NEXT: vfma.f64 d1, d9, d6
				; CHECK-NEXT: vstrw.32 q2, [r0]
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: b .LBB1_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast double* %a to <4 x double>*
				%b.ptr = bitcast double* %b to <4 x double>*
				%c.ptr = bitcast double* %c to <4 x double>*
				%a.val = load <4 x double>, <4 x double>* %a.ptr
				%b.val = load <4 x double>, <4 x double>* %b.ptr
				%strided.vec = shufflevector <4 x double> %a.val, <4 x double> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec46 = shufflevector <4 x double> %a.val, <4 x double> poison, <2 x i32> <i32 1, i32 3>
				%strided.vec48 = shufflevector <4 x double> %b.val, <4 x double> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec49 = shufflevector <4 x double> %b.val, <4 x double> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x double> %strided.vec48, %strided.vec
				%1 = fmul fast <2 x double> %strided.vec49, %strided.vec46
				%2 = fsub fast <2 x double> %0, %1
				%3 = fmul fast <2 x double> %strided.vec49, %strided.vec
				%4 = fmul fast <2 x double> %strided.vec48, %strided.vec46
				%5 = fadd fast <2 x double> %3, %4
				%6 = bitcast double* undef to <4 x double>*
				%interleaved.vec = shufflevector <2 x double> %2, <2 x double> %5, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				store <4 x double> %interleaved.vec, <4 x double>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v8f64(double* %a, double* %b, double* %c) #0 {
				; CHECK-LABEL: complex_mul_v8f64:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13}
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB2_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r0]
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vldrw.u32 q3, [r0, #32]
				; CHECK-NEXT: vldrw.u32 q4, [r1, #32]
				; CHECK-NEXT: vmul.f64 d2, d5, d1
				; CHECK-NEXT: vmul.f64 d3, d4, d1
				; CHECK-NEXT: vfnms.f64 d2, d4, d0
				; CHECK-NEXT: vfma.f64 d3, d5, d0
				; CHECK-NEXT: vmul.f64 d0, d9, d7
				; CHECK-NEXT: vmul.f64 d1, d8, d7
				; CHECK-NEXT: vldrw.u32 q2, [r0, #48]
				; CHECK-NEXT: vldrw.u32 q5, [r1, #48]
				; CHECK-NEXT: vfnms.f64 d0, d8, d6
				; CHECK-NEXT: vfma.f64 d1, d9, d6
				; CHECK-NEXT: vldrw.u32 q4, [r0, #16]
				; CHECK-NEXT: vldrw.u32 q6, [r1, #16]
				; CHECK-NEXT: vmul.f64 d6, d11, d5
				; CHECK-NEXT: vmul.f64 d7, d10, d5
				; CHECK-NEXT: vfnms.f64 d6, d10, d4
				; CHECK-NEXT: vfma.f64 d7, d11, d4
				; CHECK-NEXT: vmul.f64 d4, d13, d9
				; CHECK-NEXT: vmul.f64 d5, d12, d9
				; CHECK-NEXT: vfnms.f64 d4, d12, d8
				; CHECK-NEXT: vfma.f64 d5, d13, d8
				; CHECK-NEXT: vstrw.32 q3, [r0]
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vstrw.32 q2, [r0]
				; CHECK-NEXT: vstrw.32 q1, [r0]
				; CHECK-NEXT: b .LBB2_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast double* %a to <8 x double>*
				%b.ptr = bitcast double* %b to <8 x double>*
				%c.ptr = bitcast double* %c to <8 x double>*
				%a.val = load <8 x double>, <8 x double>* %a.ptr
				%b.val = load <8 x double>, <8 x double>* %b.ptr
				%strided.vec = shufflevector <8 x double> %a.val, <8 x double> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec46 = shufflevector <8 x double> %a.val, <8 x double> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec48 = shufflevector <8 x double> %b.val, <8 x double> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec49 = shufflevector <8 x double> %b.val, <8 x double> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x double> %strided.vec48, %strided.vec
				%1 = fmul fast <4 x double> %strided.vec49, %strided.vec46
				%2 = fsub fast <4 x double> %0, %1
				%3 = fmul fast <4 x double> %strided.vec49, %strided.vec
				%4 = fmul fast <4 x double> %strided.vec48, %strided.vec46
				%5 = fadd fast <4 x double> %3, %4
				%6 = bitcast double* undef to <8 x double>*
				%interleaved.vec = shufflevector <4 x double> %2, <4 x double> %5, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x double> %interleaved.vec, <8 x double>* %6, align 4
				br label %vector.body
				}

				define void @complex_mul_v16f64(double* %a, double* %b, double* %c) #0 {
				; CHECK-LABEL: complex_mul_v16f64:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: vpush {d8, d9, d10, d11, d12, d13, d14, d15}
				; CHECK-NEXT: .pad #48
				; CHECK-NEXT: sub sp, #48
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB3_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vldrw.u32 q3, [r0, #16]
				; CHECK-NEXT: vmul.f64 d0, d5, d3
				; CHECK-NEXT: vmul.f64 d1, d4, d3
				; CHECK-NEXT: vfnms.f64 d0, d4, d2
				; CHECK-NEXT: vfma.f64 d1, d5, d2
				; CHECK-NEXT: vldrw.u32 q4, [r1, #16]
				; CHECK-NEXT: vldrw.u32 q5, [r0, #32]
				; CHECK-NEXT: vstrw.32 q0, [sp, #32] @ 16-byte Spill
				; CHECK-NEXT: vmul.f64 d0, d9, d7
				; CHECK-NEXT: vmul.f64 d1, d8, d7
				; CHECK-NEXT: vfnms.f64 d0, d8, d6
				; CHECK-NEXT: vfma.f64 d1, d9, d6
				; CHECK-NEXT: vldrw.u32 q6, [r1, #32]
				; CHECK-NEXT: vldrw.u32 q4, [r0, #48]
				; CHECK-NEXT: vldrw.u32 q7, [r1, #48]
				; CHECK-NEXT: vstrw.32 q0, [sp, #16] @ 16-byte Spill
				; CHECK-NEXT: vmul.f64 d0, d13, d11
				; CHECK-NEXT: vmul.f64 d1, d12, d11
				; CHECK-NEXT: vfnms.f64 d0, d12, d10
				; CHECK-NEXT: vfma.f64 d1, d13, d10
				; CHECK-NEXT: vmul.f64 d6, d15, d9
				; CHECK-NEXT: vmul.f64 d7, d14, d9
				; CHECK-NEXT: vldrw.u32 q5, [r0, #64]
				; CHECK-NEXT: vldrw.u32 q6, [r1, #64]
				; CHECK-NEXT: vfnms.f64 d6, d14, d8
				; CHECK-NEXT: vfma.f64 d7, d15, d8
				; CHECK-NEXT: vldrw.u32 q7, [r0, #96]
				; CHECK-NEXT: vldrw.u32 q1, [r1, #96]
				; CHECK-NEXT: vmul.f64 d8, d13, d11
				; CHECK-NEXT: vmul.f64 d9, d12, d11
				; CHECK-NEXT: vfnms.f64 d8, d12, d10
				; CHECK-NEXT: vfma.f64 d9, d13, d10
				; CHECK-NEXT: vmul.f64 d10, d3, d15
				; CHECK-NEXT: vmul.f64 d11, d2, d15
				; CHECK-NEXT: vldrw.u32 q6, [r0, #112]
				; CHECK-NEXT: vldrw.u32 q2, [r1, #112]
				; CHECK-NEXT: vstrw.32 q0, [sp] @ 16-byte Spill
				; CHECK-NEXT: vfnms.f64 d10, d2, d14
				; CHECK-NEXT: vfma.f64 d11, d3, d14
				; CHECK-NEXT: vldrw.u32 q7, [r0, #80]
				; CHECK-NEXT: vldrw.u32 q0, [r1, #80]
				; CHECK-NEXT: vmul.f64 d2, d5, d13
				; CHECK-NEXT: vmul.f64 d3, d4, d13
				; CHECK-NEXT: vfnms.f64 d2, d4, d12
				; CHECK-NEXT: vfma.f64 d3, d5, d12
				; CHECK-NEXT: vmul.f64 d4, d1, d15
				; CHECK-NEXT: vmul.f64 d5, d0, d15
				; CHECK-NEXT: vfnms.f64 d4, d0, d14
				; CHECK-NEXT: vfma.f64 d5, d1, d14
				; CHECK-NEXT: vldrw.u32 q0, [sp] @ 16-byte Reload
				; CHECK-NEXT: vstrw.32 q1, [r0]
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vldrw.u32 q0, [sp, #16] @ 16-byte Reload
				; CHECK-NEXT: vstrw.32 q5, [r0]
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vldrw.u32 q0, [sp, #32] @ 16-byte Reload
				; CHECK-NEXT: vstrw.32 q2, [r0]
				; CHECK-NEXT: vstrw.32 q4, [r0]
				; CHECK-NEXT: vstrw.32 q3, [r0]
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: b .LBB3_1
				vector.ph:
				br label %vector.body

				vector.body:
				%a.ptr = bitcast double* %a to <16 x double>*
				%b.ptr = bitcast double* %b to <16 x double>*
				%c.ptr = bitcast double* %c to <16 x double>*
				%a.val = load <16 x double>, <16 x double>* %a.ptr
				%b.val = load <16 x double>, <16 x double>* %b.ptr
				%strided.vec = shufflevector <16 x double> %a.val, <16 x double> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec46 = shufflevector <16 x double> %a.val, <16 x double> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%strided.vec48 = shufflevector <16 x double> %b.val, <16 x double> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%strided.vec49 = shufflevector <16 x double> %b.val, <16 x double> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%0 = fmul fast <8 x double> %strided.vec48, %strided.vec
				%1 = fmul fast <8 x double> %strided.vec49, %strided.vec46
				%2 = fsub fast <8 x double> %0, %1
				%3 = fmul fast <8 x double> %strided.vec49, %strided.vec
				%4 = fmul fast <8 x double> %strided.vec48, %strided.vec46
				%5 = fadd fast <8 x double> %3, %4
				%6 = bitcast double* undef to <16 x double>*
				%interleaved.vec = shufflevector <8 x double> %2, <8 x double> %5, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store <16 x double> %interleaved.vec, <16 x double>* %6, align 4
				br label %vector.body
				}

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-rotations-add.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -o - \| FileCheck %s

				target triple = "arm-arm-none-eabi"
				attributes #0 = { "target-cpu"="cortex-m55" }

				define void @complex_rotation_I() #0 {
				; CHECK-LABEL: complex_rotation_I:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: movs r0, #16
				; CHECK-NEXT: movs r1, #0
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB0_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r1]
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vcadd.f32 q2, q0, q0, #90
				; CHECK-NEXT: vcadd.f32 q0, q1, q1, #90
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vstrw.32 q2, [r1]
				; CHECK-NEXT: b .LBB0_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%store.ptr = bitcast { float, float }* null to <8 x float>*
				%load.vec1 = load <8 x float>, <8 x float>* null, align 4
				%load.vec2 = load <8 x float>, <8 x float>* null, align 4
				%strided.vec = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec39 = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec41 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec42 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fsub fast <4 x float> %strided.vec41, %strided.vec39
				%1 = fadd fast <4 x float> %strided.vec42, %strided.vec
				%interleaved.vec = shufflevector <4 x float> %0, <4 x float> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x float> %interleaved.vec, <8 x float>* %store.ptr, align 4
				br label %vector.body
				}

				define void @complex_rotation_III() #0 {
				; CHECK-LABEL: complex_rotation_III:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: movs r0, #16
				; CHECK-NEXT: movs r1, #0
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB1_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r1]
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vcadd.f32 q2, q0, q0, #270
				; CHECK-NEXT: vcadd.f32 q0, q1, q1, #270
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vstrw.32 q2, [r1]
				; CHECK-NEXT: b .LBB1_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%store.ptr = bitcast { float, float }* null to <8 x float>*
				%load.vec1 = load <8 x float>, <8 x float>* null, align 4
				%load.vec2 = load <8 x float>, <8 x float>* null, align 4
				%strided.vec = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec39 = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec41 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec42 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fadd fast <4 x float> %strided.vec42, %strided.vec
				%1 = fsub fast <4 x float> %strided.vec39, %strided.vec41
				%interleaved.vec = shufflevector <4 x float> %0, <4 x float> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x float> %interleaved.vec, <8 x float>* %store.ptr, align 4
				br label %vector.body
				}

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-rotations-mul.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -o - \| FileCheck %s

				target triple = "arm-arm-none-eabi"
				attributes #0 = { "target-cpu"="cortex-m55" }

				define void @complex_rotation() #0 {
				; CHECK-LABEL: complex_rotation:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: movs r0, #16
				; CHECK-NEXT: movs r1, #0
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB0_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r1]
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vcmul.f32 q2, q0, q0, #0
				; CHECK-NEXT: vcmla.f32 q2, q0, q0, #90
				; CHECK-NEXT: vcmul.f32 q0, q1, q1, #0
				; CHECK-NEXT: vcmla.f32 q0, q1, q1, #90
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vstrw.32 q2, [r1]
				; CHECK-NEXT: b .LBB0_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%store.ptr = bitcast { float, float }* null to <8 x float>*
				%load.vec1 = load <8 x float>, <8 x float>* null, align 4
				%load.vec2 = load <8 x float>, <8 x float>* null, align 4
				%strided.vec = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec39 = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec41 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec42 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x float> %strided.vec41, %strided.vec
				%1 = fmul fast <4 x float> %strided.vec42, %strided.vec39
				%2 = fsub fast <4 x float> %0, %1
				%3 = fmul fast <4 x float> %strided.vec42, %strided.vec
				%4 = fmul fast <4 x float> %strided.vec41, %strided.vec39
				%5 = fadd fast <4 x float> %3, %4
				%interleaved.vec = shufflevector <4 x float> %2, <4 x float> %5, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x float> %interleaved.vec, <8 x float>* %store.ptr, align 4
				br label %vector.body
				}

				define void @complex_rotation_I() #0 {
				; CHECK-LABEL: complex_rotation_I:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: movs r0, #0
				; CHECK-NEXT: movs r1, #16
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB1_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vmov.f32 s0, s4
				; CHECK-NEXT: vmov.f32 s1, s6
				; CHECK-NEXT: vmov.f32 s4, s5
				; CHECK-NEXT: vmov.f32 s5, s7
				; CHECK-NEXT: vmov.f32 s6, s9
				; CHECK-NEXT: vmov.f32 s7, s11
				; CHECK-NEXT: vmov.f32 s2, s8
				; CHECK-NEXT: vmov.f32 s3, s10
				; CHECK-NEXT: vmul.f32 q2, q1, q1
				; CHECK-NEXT: vneg.f32 q3, q2
				; CHECK-NEXT: vmul.f32 q2, q0, q1
				; CHECK-NEXT: vneg.f32 q2, q2
				; CHECK-NEXT: vfma.f32 q3, q0, q0
				; CHECK-NEXT: vfms.f32 q2, q0, q1
				; CHECK-NEXT: vst20.32 {q2, q3}, [r0]
				; CHECK-NEXT: vst21.32 {q2, q3}, [r0]
				; CHECK-NEXT: b .LBB1_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%store.ptr = bitcast { float, float }* null to <8 x float>*
				%load.vec1 = load <8 x float>, <8 x float>* null, align 4
				%load.vec2 = load <8 x float>, <8 x float>* null, align 4
				%strided.vec = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec39 = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec41 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec42 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x float> %strided.vec41, %strided.vec
				%1 = fmul fast <4 x float> %strided.vec42, %strided.vec39
				%2 = fsub fast <4 x float> %0, %1
				%3 = fneg fast <4 x float> %strided.vec
				%4 = fmul fast <4 x float> %strided.vec42, %3
				%5 = fmul fast <4 x float> %strided.vec41, %strided.vec39
				%6 = fsub fast <4 x float> %4, %5
				%interleaved.vec = shufflevector <4 x float> %6, <4 x float> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x float> %interleaved.vec, <8 x float>* %store.ptr, align 4
				br label %vector.body
				}


				define void @complex_rotation_II() #0 {
				; CHECK-LABEL: complex_rotation_II:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: movs r0, #16
				; CHECK-NEXT: movs r1, #0
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB2_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q0, [r1]
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vcmul.f32 q2, q0, q0, #180
				; CHECK-NEXT: vcmla.f32 q2, q0, q0, #270
				; CHECK-NEXT: vcmul.f32 q0, q1, q1, #180
				; CHECK-NEXT: vcmla.f32 q0, q1, q1, #270
				; CHECK-NEXT: vstrw.32 q0, [r0]
				; CHECK-NEXT: vstrw.32 q2, [r1]
				; CHECK-NEXT: b .LBB2_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%store.ptr = bitcast { float, float }* null to <8 x float>*
				%load.vec1 = load <8 x float>, <8 x float>* null, align 4
				%load.vec2 = load <8 x float>, <8 x float>* null, align 4
				%strided.vec = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec39 = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec41 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec42 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec39.neg = fneg fast <4 x float> %strided.vec39
				%0 = fmul fast <4 x float> %strided.vec41, %strided.vec39.neg
				%1 = fmul fast <4 x float> %strided.vec42, %strided.vec
				%2 = fsub fast <4 x float> %0, %1
				%3 = fmul fast <4 x float> %strided.vec42, %strided.vec39
				%4 = fmul fast <4 x float> %strided.vec41, %strided.vec
				%5 = fsub fast <4 x float> %3, %4
				%interleaved.vec = shufflevector <4 x float> %5, <4 x float> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x float> %interleaved.vec, <8 x float>* %store.ptr, align 4
				br label %vector.body
				}

				define void @complex_rotation_III() #0 {
				; CHECK-LABEL: complex_rotation_III:
				; CHECK: @ %bb.0: @ %vector.ph
				; CHECK-NEXT: movs r0, #0
				; CHECK-NEXT: movs r1, #16
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: .LBB3_1: @ %vector.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrw.u32 q1, [r0]
				; CHECK-NEXT: vldrw.u32 q2, [r1]
				; CHECK-NEXT: vmov.f32 s0, s4
				; CHECK-NEXT: vmov.f32 s1, s6
				; CHECK-NEXT: vmov.f32 s4, s5
				; CHECK-NEXT: vmov.f32 s5, s7
				; CHECK-NEXT: vmov.f32 s6, s9
				; CHECK-NEXT: vmov.f32 s7, s11
				; CHECK-NEXT: vmov.f32 s2, s8
				; CHECK-NEXT: vmov.f32 s3, s10
				; CHECK-NEXT: vmul.f32 q2, q1, q1
				; CHECK-NEXT: vneg.f32 q3, q2
				; CHECK-NEXT: vmul.f32 q2, q0, q1
				; CHECK-NEXT: vfma.f32 q3, q0, q0
				; CHECK-NEXT: vfma.f32 q2, q1, q0
				; CHECK-NEXT: vst20.32 {q2, q3}, [r0]
				; CHECK-NEXT: vst21.32 {q2, q3}, [r0]
				; CHECK-NEXT: b .LBB3_1
				vector.ph:
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%store.ptr = bitcast { float, float }* null to <8 x float>*
				%load.vec1 = load <8 x float>, <8 x float>* null, align 4
				%load.vec2 = load <8 x float>, <8 x float>* null, align 4
				%strided.vec = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec39 = shufflevector <8 x float> %load.vec1, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%strided.vec41 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%strided.vec42 = shufflevector <8 x float> %load.vec2, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%0 = fmul fast <4 x float> %strided.vec41, %strided.vec
				%1 = fmul fast <4 x float> %strided.vec42, %strided.vec39
				%2 = fsub fast <4 x float> %0, %1
				%3 = fmul fast <4 x float> %strided.vec42, %strided.vec
				%4 = fmul fast <4 x float> %strided.vec41, %strided.vec39
				%5 = fadd fast <4 x float> %3, %4
				%interleaved.vec = shufflevector <4 x float> %5, <4 x float> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x float> %interleaved.vec, <8 x float>* %store.ptr, align 4
				br label %vector.body
				}

This is an archive of the discontinued LLVM Phabricator instance.

[ARM][CodeGen] Add support for complex deinterleavingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 406409

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/InitializePasses.h

llvm/include/llvm/Transforms/Scalar.h

llvm/include/llvm/Transforms/Scalar/ComplexArithmetic.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Passes/PassBuilder.cpp

llvm/lib/Passes/PassRegistry.def

llvm/lib/Target/ARM/ARMTargetMachine.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Transforms/Scalar/CMakeLists.txt

llvm/lib/Transforms/Scalar/ComplexArithmetic.cpp

llvm/test/CodeGen/ARM/O3-pipeline.ll

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f16-add.ll

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f16-mul.ll

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f32-add.ll

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f32-mul.ll

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-f64-mul.ll

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-rotations-add.ll

llvm/test/Transforms/ComplexArithmetic/ARM/complex-arithmetic-rotations-mul.ll

[ARM][CodeGen] Add support for complex deinterleaving
ClosedPublic