Page MenuHomePhabricator

[ARM][CodeGen] Add support for complex deinterleaving
ClosedPublic

Authored by NickGuy on Nov 18 2021, 10:29 AM.

Details

Summary

Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
NickGuy added inline comments.Sep 12 2022, 5:58 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
590

Attempting to plumb something together that acceps both ComplexDeinterleavingCompositeNode* and Value* caused more bloat and made things more difficult to follow. And we can't simply use Node->ReplacementNode as the Value* because it hasn't been assigned yet.
Because of that constraint, I'm opting to find the nodes later, though I'm open to ideas.

dmgreen added inline comments.Sep 13 2022, 2:22 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
169

Why is this needed? When are the Real and Imag values not already correct? It seems strange to be able to reverse them.

438

What about the other pair? They are not guaranteed to be the same as the operands from identifyPartialMul as far as I can see. I would probably make sure that the two operands are a valid match too. The second call to identifyNode with the same operands should just find the same node again. And that way this can have proper operands.

Currently this and identifyPartialMul are very much interlinked. I hope we can improve that in the future, but it sounds like an extension that can be thought about later.

590

I was expecting it to just accept ComplexDeinterleavingCompositeNode. If all the operands are ComplexDeinterleavingCompositeNode, then they just need to be visited in the correct order to assign ReplacementNode based on how they need to be transformed. (Which I believe is fine providing we visit them in reverse order, like is already done).

i.e. we construct a graph made up of ComplexDeinterleavingCompositeNode, then transform that graph. That seems like a simpler, more extensible design going forward. It gets more complicated if there is a mixing of values between the original IR and the newly constructed intrinsics. I could imagine that might make graphs (as opposed to DAGs) more difficult, but we don't yet support any cycles.

724–732

Sorry - I'm not sure what this was referring to now! Do you think that example should be transforming into complex intrinsics? It doesn't look valid to me.

Have you added it as a test case?

737–738

Can we move this into identifyNodes (or maybe at this point there will always be nodes?)

That would make replaceNodes always return true, which simplifies the getDeadRoots a little too.

NickGuy updated this revision to Diff 461829.Sep 21 2022, 2:51 AM
NickGuy marked 4 inline comments as done.

Addressed comments, and redesigned the lookup list of nodes to now be represented as a graph.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
169

It was a holdover function that I reused for the .Valid check. It's not used anywhere else so I've removed it (and the relevant struct), and moved the relevant check to here.

590

I've converted the pseudo-graph (a list with lookups) to an actual graph structure, with every node being a ComplexDeinterleavingCompositeNode*, meaning that we're no longer mixing Value*s and ComplexDeinterleavingCompositeNode*s.

Have you had any luck adding a larger corpus of tests cases yet? I see that uses have been added to the graph. I would hope that we could avoid that complication, but it is hard to see if we can or not without a larger set of tests.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
762

Try not to overuse auto where the type isn't already obvious.

NickGuy updated this revision to Diff 463541.Sep 28 2022, 7:03 AM
NickGuy marked an inline comment as done and 2 inline comments as not done.

Added more interesting test cases, covering triangle and diamond multiply cases.

Thanks for adding all the extra tests. Here is another one, that is useful in itself, but more useful when we break it. It does a*b +90 a*c, and seems to work well for this example.

define arm_aapcs_vfpcc <4 x float> @test(<4 x float> %a, <4 x float> %b, <4 x float> %c) {
entry:
  %ar = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ai = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %br = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %bi = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %cr = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ci = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3>

  %i6 = fmul fast <2 x float> %br, %ar
  %i7 = fmul fast <2 x float> %bi, %ai
  %xr = fsub fast <2 x float> %i6, %i7
  %i9 = fmul fast <2 x float> %bi, %ar
  %i10 = fmul fast <2 x float> %br, %ai
  %xi = fadd fast <2 x float> %i9, %i10

  %j6 = fmul fast <2 x float> %cr, %ar
  %j7 = fmul fast <2 x float> %ci, %ai
  %yr = fsub fast <2 x float> %j6, %j7
  %j9 = fmul fast <2 x float> %ci, %ar
  %j10 = fmul fast <2 x float> %cr, %ai
  %yi = fadd fast <2 x float> %j9, %j10

  %zr = fsub fast <2 x float> %yr, %xi
  %zi = fadd fast <2 x float> %yi, %xr
  %interleaved.vec = shufflevector <2 x float> %zr, <2 x float> %zi, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

But this is a modification that alters the c*a to use part of the b*a mul's. It shouldn't be being transformed as it is, I don't believe.

define arm_aapcs_vfpcc <4 x float> @mul_triangle_addmul(<4 x float> %a, <4 x float> %b, <4 x float> %c) {
entry:
  %ar = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ai = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %br = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %bi = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %cr = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %ci = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3>

  %i6 = fmul fast <2 x float> %br, %ar
  %i7 = fmul fast <2 x float> %bi, %ai
  %xr = fsub fast <2 x float> %i6, %i7
  %i9 = fmul fast <2 x float> %bi, %ar
  %i10 = fmul fast <2 x float> %br, %ai
  %xi = fadd fast <2 x float> %i9, %i10

  ;%j6 = fmul fast <2 x float> %cr, %ar
  %j7 = fmul fast <2 x float> %ci, %ai
  %yr = fsub fast <2 x float> %i6, %j7
  ;%j9 = fmul fast <2 x float> %ci, %ar
  %j10 = fmul fast <2 x float> %cr, %ai
  %yi = fadd fast <2 x float> %i9, %j10

  %zr = fsub fast <2 x float> %yr, %xi
  %zi = fadd fast <2 x float> %yi, %xr
  %interleaved.vec = shufflevector <2 x float> %zr, <2 x float> %zi, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

The Incomplete nodes worry me as it looks like a rich source of bugs. If the identifyPartialMul and identifyNodeWithImplicitAdd need to work together more closely for the time being, that is probably fine. We can always change it in the future if needed.

Some other issues are that we need to check for multiple uses. As in something like

define arm_aapcs_vfpcc <4 x float> @mul_triangle_multiuses(<4 x float> %a, <4 x float> %b, ptr %p) {
; CHECK-LABEL: mul_triangle_multiuses:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    .vsave {d14}
; CHECK-NEXT:    vpush {d14}
; CHECK-NEXT:    .vsave {d10, d11, d12}
; CHECK-NEXT:    vpush {d10, d11, d12}
; CHECK-NEXT:    .vsave {d8}
; CHECK-NEXT:    vpush {d8}
; CHECK-NEXT:    vmov q2, q0
; CHECK-NEXT:    vmov.f32 s16, s4
; CHECK-NEXT:    vmov.f32 s17, s6
; CHECK-NEXT:    vmov.i32 q0, #0x0
; CHECK-NEXT:    vmov.f32 s20, s9
; CHECK-NEXT:    vmov.f32 s21, s11
; CHECK-NEXT:    vmov.f32 s28, s5
; CHECK-NEXT:    vmul.f32 q3, q5, q4
; CHECK-NEXT:    vmov.f32 s29, s7
; CHECK-NEXT:    vmul.f32 q5, q7, q5
; CHECK-NEXT:    vmov.f32 s24, s8
; CHECK-NEXT:    vmov.f32 s25, s10
; CHECK-NEXT:    vneg.f32 q5, q5
; CHECK-NEXT:    vfma.f32 q3, q7, q6
; CHECK-NEXT:    vfma.f32 q5, q4, q6
; CHECK-NEXT:    vmov.f32 s22, s12
; CHECK-NEXT:    vmov.f32 s23, s13
; CHECK-NEXT:    vmov q3, q0
; CHECK-NEXT:    vcmla.f32 q3, q1, q2, #0
; CHECK-NEXT:    vstrw.32 q5, [r0]
; CHECK-NEXT:    vcmla.f32 q3, q1, q2, #90
; CHECK-NEXT:    vcmla.f32 q0, q2, q3, #0
; CHECK-NEXT:    vcmla.f32 q0, q2, q3, #90
; CHECK-NEXT:    vpop {d8}
; CHECK-NEXT:    vpop {d10, d11, d12}
; CHECK-NEXT:    vpop {d14}
; CHECK-NEXT:    bx lr
entry:
  %strided.vec = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %strided.vec35 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %strided.vec37 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %strided.vec38 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %0 = fmul fast <2 x float> %strided.vec37, %strided.vec
  %1 = fmul fast <2 x float> %strided.vec38, %strided.vec35
  %2 = fsub fast <2 x float> %0, %1
  %3 = fmul fast <2 x float> %2, %strided.vec35
  %4 = fmul fast <2 x float> %strided.vec38, %strided.vec
  %5 = fmul fast <2 x float> %strided.vec35, %strided.vec37
  %6 = fadd fast <2 x float> %4, %5
  %otheruse = shufflevector <2 x float> %2, <2 x float> %6, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  store <4 x float> %otheruse, ptr %p
  %7 = fmul fast <2 x float> %6, %strided.vec
  %8 = fadd fast <2 x float> %3, %7
  %9 = fmul fast <2 x float> %2, %strided.vec
  %10 = fmul fast <2 x float> %6, %strided.vec35
  %11 = fsub fast <2 x float> %9, %10
  %interleaved.vec = shufflevector <2 x float> %11, <2 x float> %8, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

And we probably need to check for fast-math flags, where we are generating fma. I also think that it is safer if nodes are uniquely identified from {real, imag} root pairs, not just from nodes that might contain either real or imag somewhere in them.

llvm/test/CodeGen/Thumb2/complex-deinterleaving-uniform-cases.ll
1 ↗(On Diff #463541)

All the mve tests start with mve-

NickGuy updated this revision to Diff 467159.Oct 12 2022, 8:38 AM
NickGuy marked an inline comment as done.

Thanks @dmgreen for the extra tests.

...
The Incomplete nodes worry me as it looks like a rich source of bugs. If the identifyPartialMul and identifyNodeWithImplicitAdd need to work together more closely for the time being, that is probably fine. We can always change it in the future if needed.
...

I've removed the late lookup/replacement of incomplete nodes, and moved it closer to where the incomplete nodes are created. I've also made it only check against a single node, one that we know should match

...
And we probably need to check for fast-math flags, where we are generating fma. I also think that it is safer if nodes are uniquely identified from {real, imag} root pairs, not just from nodes that might contain either real or imag somewhere in them.
...

Done, we're now checking the fast-math flags (specifically contract)

dmgreen added inline comments.Oct 14 2022, 10:54 AM
llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h
37

I don't think None is ever used, and from what I can see _Placeholder is a Shuffle? That might be a better name, either Shuffle or Leaf. I would also probably remove _Incomplete. They might make sense, but it is difficult to follow and very easy to get wrong.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
106

I think it would be best to store the root Real and Imag values as part of the ComplexDeinterleavingCompositeNode. That way we can be sure we are matching a correct node from anything like getContainingComposite.
There can then be the TransientInstructions (maybe "ExtraInstructions"), which are also part of the pattern, but not the root nodes.

115

Could AccumulatorNode just be another Operand?

116

I think this would be best to avoid NodePtr, if that might keep a reference to a node alive. Avoiding shared_ptr entirely would be good, but I imagine any alternative might be difficult too. They needs to be stable values to avoid memory invalidation issues.

171

Formatting needs updating.

220

If NodeInstructions is used to calculate all the Instructions we have seen, to check that all the uses are inside the graph, it could be a SmallPtrSet which will have quicker contains().

288

It is hard to see how wrapping a Value as an _Incomplete node is always correct. It will have lost the information about whether V was real or imaginary.

576

Do you have any tests for fneg? This should probably be more like the code in identifyPartialMul where the rotation is chosen based on whether the Real/Imag parts are negated.

673

It's not the combination of rotates that is invalid really, although some might make less sense than others, and some might simplify to code that is difficult to match. The even rotations will define the real part of a, the odd rotations define the imag part. Between the two parts of the multiply we match we need to find both the real and imag halves of the value, to successfully match it further. (We may be able to do something with only half a match in some cases, like matching leafs, but that should probably be left for a later patch).

llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
79

Are you sure this is valid?
When we were doing the gather lowering, we found it useful to annotate the tests with OK/Bad, so if they change we could see the ones we thought shouldn't.

NickGuy updated this revision to Diff 469583.Oct 21 2022, 6:47 AM
NickGuy marked 10 inline comments as done.

Addressed comments

Thanks for the updates. This is looking good now.

There are some extra comments below. I was trying some examples that have awkward orders and cross basicblocks, but couldn't find ways to make it break.

llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h
40–43

This can now be re-flowed.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
20

Can this comment be updated now?

560

These can be // comments. They are very useful, but maybe it is not necessary to repeat them.

llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
4

Are there any tests with missing fast-math flags?

5

I managed to come up with this testcase that failed because of the vector size. It is probably larger than it needs to be. It could work or not, depending if the backend handles non-power-2 sizes. In either rate, it would be good to add this example. (Once integers are added too, some odd size integer width tests would be good too).

define arm_aapcs_vfpcc <12 x float> @abp90c12(<12 x float> %a, <12 x float> %b, <12 x float> %c) {
entry:
  %ar = shufflevector <12 x float> %a, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10>
  %ai = shufflevector <12 x float> %a, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11>
  %br = shufflevector <12 x float> %b, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10>
  %bi = shufflevector <12 x float> %b, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11>
  %cr = shufflevector <12 x float> %c, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10>
  %ci = shufflevector <12 x float> %c, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11>

  %i6 = fmul fast <6 x float> %br, %ar
  %i7 = fmul fast <6 x float> %bi, %ai
  %xr = fsub fast <6 x float> %i6, %i7
  %i9 = fmul fast <6 x float> %bi, %ar
  %i10 = fmul fast <6 x float> %br, %ai
  %xi = fadd fast <6 x float> %i9, %i10

  %zr = fsub fast <6 x float> %cr, %xi
  %zi = fadd fast <6 x float> %ci, %xr
  %interleaved.vec = shufflevector <6 x float> %zr, <6 x float> %zi, <12 x i32> <i32 0, i32 6, i32 1, i32 7, i32 2, i32 8, i32 3, i32 9, i32 4, i32 10, i32 5, i32 11>
  ret <12 x float> %interleaved.vec
}
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
79

Is this just not transforming because of commutativity on the add? Can we add a test for fadd fast <2 x float> %strided.vec20, %strided.vec too, to show it does transform.

SjoerdMeijer added a comment.EditedOct 26 2022, 3:14 AM

Came here to say that this looks nice and would be worthy a mention in the release notes, so that can be added, I think.

Then I started reading the patch again a bit and couldn't help myself writing down some nits.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
17

Nit: perhaps just omit in parallel? Or if you want to keep it, it could benefit from a clarification.

26

I wouldn't mind a few more words/sentences about the internal datastructures, the graph, nodes, etc.

51

Nit: remove deinterleaving from this description?

60

Nit: newline

110

Can you say what these instructions could be?

174

I haven't looked carefully, but why do we also need to record all instructions here? I mean, the graph can be queried for existence of nodes/instructions, or is this bookkeeping duplication more efficient?

182

Nit: no validation going on here, only insertion?

182

Nit: no validation going on here, only insertion?

240

Nit: newline?

313

Nit: can omit the comparison with 1.

344

Nit: I am a fan of reducing indentation, e.g.:

auto *SVI = dyn_cast<ShuffleVectorInst>(&I);
if (!SVI)
  continue;
if (!isInterleavingMask(SVI->getShuffleMask())
  continue;

etc.

368

Nit: maybe I overlooked something, but this doesn't seem to use any state from the class so could be just a helper function?

468

Nit: I am wondering if these casts here and below are necessary.

484

nit: maybe some or all of these node setting things can be done in a constructor.

540

Nits:
Op -> Operand,
not Instruction -> not a Instruction?

NickGuy updated this revision to Diff 473955.Nov 8 2022, 4:10 AM
NickGuy marked 23 inline comments as done.
NickGuy added inline comments.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
174

The duplication makes it simpler to check the internal uses of a node (See ComplexDeinterleavingCompositeNode::hasAllInternalUses). Without it, we'd need to iterate over the nodes twice, and check the instructions of each node (All of Real, Imag, and the contents of InternalInstructions), while this approach allows us to simply say "Is this instruction known within the graph?".

182

Good catch, the validation that was here was moved to be more tied into the identification. Renamed

368

It populates the compositeNode structure through (formerly) validateAndSubmitCompositeNode, and also calls identifyNode further, which does use state from the class.

468

You are correct, good catch.

484

Not sure I agree with that, there are cases - such as a Shuffle - where none of the other settings are relevant. The operation type, real instruction, and imaginary instructions are the only settings relevant to all cases.

llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll
5

Fixed this case (by checking on the backend whether the vector width is a power of 2), and added the test.

llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
79

Yep, it's the commutativity. I've added a note to this test stating this, and added another test with the fadd operands reversed.

dmgreen added inline comments.Nov 10 2022, 9:19 AM
llvm/include/llvm/InitializePasses.h
101

You can undo these

111

There is a newline here?

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
413

If you can turn some of these into better sentances that would be good. Perhaps "Mul instruction has multiple uses" like you have below.

482

Formatting is a bit long.

llvm/lib/Target/ARM/ARMTargetMachine.cpp
437

This again.

llvm/test/CodeGen/ARM/O3-pipeline.ll
2

This looks like it was incorrectly added for this test.

khchen added a subscriber: khchen.Nov 10 2022, 10:06 AM
NickGuy updated this revision to Diff 474743.Nov 11 2022, 6:39 AM
NickGuy marked 6 inline comments as done.
dmgreen accepted this revision.Nov 14 2022, 5:39 AM

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

llvm/include/llvm/InitializePasses.h
108

You can undo these.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
591

Formatting.

llvm/lib/Target/ARM/ARMISelLowering.cpp
21860

Formatting.

21868

Add a message to the assert

This revision is now accepted and ready to land.Nov 14 2022, 5:39 AM
This revision was landed with ongoing or failed builds.Nov 14 2022, 6:04 AM
This revision was automatically updated to reflect the committed changes.
NickGuy marked 4 inline comments as done.

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

Agreed, and thanks for working on this.

Question about AArch64, is that something you are going to work on next?

OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.

Some minor comments below but otherwise LGTM.

Agreed, and thanks for working on this.

Question about AArch64, is that something you are going to work on next?

Yep, over at D129066. It's currently slightly behind this, but is mostly there already.