Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
590 | Attempting to plumb something together that acceps both ComplexDeinterleavingCompositeNode* and Value* caused more bloat and made things more difficult to follow. And we can't simply use Node->ReplacementNode as the Value* because it hasn't been assigned yet. |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
169 | Why is this needed? When are the Real and Imag values not already correct? It seems strange to be able to reverse them. | |
438 | What about the other pair? They are not guaranteed to be the same as the operands from identifyPartialMul as far as I can see. I would probably make sure that the two operands are a valid match too. The second call to identifyNode with the same operands should just find the same node again. And that way this can have proper operands. Currently this and identifyPartialMul are very much interlinked. I hope we can improve that in the future, but it sounds like an extension that can be thought about later. | |
590 | I was expecting it to just accept ComplexDeinterleavingCompositeNode. If all the operands are ComplexDeinterleavingCompositeNode, then they just need to be visited in the correct order to assign ReplacementNode based on how they need to be transformed. (Which I believe is fine providing we visit them in reverse order, like is already done). i.e. we construct a graph made up of ComplexDeinterleavingCompositeNode, then transform that graph. That seems like a simpler, more extensible design going forward. It gets more complicated if there is a mixing of values between the original IR and the newly constructed intrinsics. I could imagine that might make graphs (as opposed to DAGs) more difficult, but we don't yet support any cycles. | |
724–732 | Sorry - I'm not sure what this was referring to now! Do you think that example should be transforming into complex intrinsics? It doesn't look valid to me. Have you added it as a test case? | |
737–738 | Can we move this into identifyNodes (or maybe at this point there will always be nodes?) That would make replaceNodes always return true, which simplifies the getDeadRoots a little too. |
Addressed comments, and redesigned the lookup list of nodes to now be represented as a graph.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
169 | It was a holdover function that I reused for the .Valid check. It's not used anywhere else so I've removed it (and the relevant struct), and moved the relevant check to here. | |
590 | I've converted the pseudo-graph (a list with lookups) to an actual graph structure, with every node being a ComplexDeinterleavingCompositeNode*, meaning that we're no longer mixing Value*s and ComplexDeinterleavingCompositeNode*s. |
Have you had any luck adding a larger corpus of tests cases yet? I see that uses have been added to the graph. I would hope that we could avoid that complication, but it is hard to see if we can or not without a larger set of tests.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
762 | Try not to overuse auto where the type isn't already obvious. |
Thanks for adding all the extra tests. Here is another one, that is useful in itself, but more useful when we break it. It does a*b +90 a*c, and seems to work well for this example.
define arm_aapcs_vfpcc <4 x float> @test(<4 x float> %a, <4 x float> %b, <4 x float> %c) { entry: %ar = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2> %ai = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3> %br = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2> %bi = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3> %cr = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2> %ci = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3> %i6 = fmul fast <2 x float> %br, %ar %i7 = fmul fast <2 x float> %bi, %ai %xr = fsub fast <2 x float> %i6, %i7 %i9 = fmul fast <2 x float> %bi, %ar %i10 = fmul fast <2 x float> %br, %ai %xi = fadd fast <2 x float> %i9, %i10 %j6 = fmul fast <2 x float> %cr, %ar %j7 = fmul fast <2 x float> %ci, %ai %yr = fsub fast <2 x float> %j6, %j7 %j9 = fmul fast <2 x float> %ci, %ar %j10 = fmul fast <2 x float> %cr, %ai %yi = fadd fast <2 x float> %j9, %j10 %zr = fsub fast <2 x float> %yr, %xi %zi = fadd fast <2 x float> %yi, %xr %interleaved.vec = shufflevector <2 x float> %zr, <2 x float> %zi, <4 x i32> <i32 0, i32 2, i32 1, i32 3> ret <4 x float> %interleaved.vec }
But this is a modification that alters the c*a to use part of the b*a mul's. It shouldn't be being transformed as it is, I don't believe.
define arm_aapcs_vfpcc <4 x float> @mul_triangle_addmul(<4 x float> %a, <4 x float> %b, <4 x float> %c) { entry: %ar = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2> %ai = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3> %br = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2> %bi = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3> %cr = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2> %ci = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3> %i6 = fmul fast <2 x float> %br, %ar %i7 = fmul fast <2 x float> %bi, %ai %xr = fsub fast <2 x float> %i6, %i7 %i9 = fmul fast <2 x float> %bi, %ar %i10 = fmul fast <2 x float> %br, %ai %xi = fadd fast <2 x float> %i9, %i10 ;%j6 = fmul fast <2 x float> %cr, %ar %j7 = fmul fast <2 x float> %ci, %ai %yr = fsub fast <2 x float> %i6, %j7 ;%j9 = fmul fast <2 x float> %ci, %ar %j10 = fmul fast <2 x float> %cr, %ai %yi = fadd fast <2 x float> %i9, %j10 %zr = fsub fast <2 x float> %yr, %xi %zi = fadd fast <2 x float> %yi, %xr %interleaved.vec = shufflevector <2 x float> %zr, <2 x float> %zi, <4 x i32> <i32 0, i32 2, i32 1, i32 3> ret <4 x float> %interleaved.vec }
The Incomplete nodes worry me as it looks like a rich source of bugs. If the identifyPartialMul and identifyNodeWithImplicitAdd need to work together more closely for the time being, that is probably fine. We can always change it in the future if needed.
Some other issues are that we need to check for multiple uses. As in something like
define arm_aapcs_vfpcc <4 x float> @mul_triangle_multiuses(<4 x float> %a, <4 x float> %b, ptr %p) { ; CHECK-LABEL: mul_triangle_multiuses: ; CHECK: @ %bb.0: @ %entry ; CHECK-NEXT: .vsave {d14} ; CHECK-NEXT: vpush {d14} ; CHECK-NEXT: .vsave {d10, d11, d12} ; CHECK-NEXT: vpush {d10, d11, d12} ; CHECK-NEXT: .vsave {d8} ; CHECK-NEXT: vpush {d8} ; CHECK-NEXT: vmov q2, q0 ; CHECK-NEXT: vmov.f32 s16, s4 ; CHECK-NEXT: vmov.f32 s17, s6 ; CHECK-NEXT: vmov.i32 q0, #0x0 ; CHECK-NEXT: vmov.f32 s20, s9 ; CHECK-NEXT: vmov.f32 s21, s11 ; CHECK-NEXT: vmov.f32 s28, s5 ; CHECK-NEXT: vmul.f32 q3, q5, q4 ; CHECK-NEXT: vmov.f32 s29, s7 ; CHECK-NEXT: vmul.f32 q5, q7, q5 ; CHECK-NEXT: vmov.f32 s24, s8 ; CHECK-NEXT: vmov.f32 s25, s10 ; CHECK-NEXT: vneg.f32 q5, q5 ; CHECK-NEXT: vfma.f32 q3, q7, q6 ; CHECK-NEXT: vfma.f32 q5, q4, q6 ; CHECK-NEXT: vmov.f32 s22, s12 ; CHECK-NEXT: vmov.f32 s23, s13 ; CHECK-NEXT: vmov q3, q0 ; CHECK-NEXT: vcmla.f32 q3, q1, q2, #0 ; CHECK-NEXT: vstrw.32 q5, [r0] ; CHECK-NEXT: vcmla.f32 q3, q1, q2, #90 ; CHECK-NEXT: vcmla.f32 q0, q2, q3, #0 ; CHECK-NEXT: vcmla.f32 q0, q2, q3, #90 ; CHECK-NEXT: vpop {d8} ; CHECK-NEXT: vpop {d10, d11, d12} ; CHECK-NEXT: vpop {d14} ; CHECK-NEXT: bx lr entry: %strided.vec = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2> %strided.vec35 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3> %strided.vec37 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2> %strided.vec38 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3> %0 = fmul fast <2 x float> %strided.vec37, %strided.vec %1 = fmul fast <2 x float> %strided.vec38, %strided.vec35 %2 = fsub fast <2 x float> %0, %1 %3 = fmul fast <2 x float> %2, %strided.vec35 %4 = fmul fast <2 x float> %strided.vec38, %strided.vec %5 = fmul fast <2 x float> %strided.vec35, %strided.vec37 %6 = fadd fast <2 x float> %4, %5 %otheruse = shufflevector <2 x float> %2, <2 x float> %6, <4 x i32> <i32 0, i32 1, i32 2, i32 3> store <4 x float> %otheruse, ptr %p %7 = fmul fast <2 x float> %6, %strided.vec %8 = fadd fast <2 x float> %3, %7 %9 = fmul fast <2 x float> %2, %strided.vec %10 = fmul fast <2 x float> %6, %strided.vec35 %11 = fsub fast <2 x float> %9, %10 %interleaved.vec = shufflevector <2 x float> %11, <2 x float> %8, <4 x i32> <i32 0, i32 2, i32 1, i32 3> ret <4 x float> %interleaved.vec }
And we probably need to check for fast-math flags, where we are generating fma. I also think that it is safer if nodes are uniquely identified from {real, imag} root pairs, not just from nodes that might contain either real or imag somewhere in them.
llvm/test/CodeGen/Thumb2/complex-deinterleaving-uniform-cases.ll | ||
---|---|---|
1 ↗ | (On Diff #463541) | All the mve tests start with mve- |
Thanks @dmgreen for the extra tests.
I've removed the late lookup/replacement of incomplete nodes, and moved it closer to where the incomplete nodes are created. I've also made it only check against a single node, one that we know should match
...
And we probably need to check for fast-math flags, where we are generating fma. I also think that it is safer if nodes are uniquely identified from {real, imag} root pairs, not just from nodes that might contain either real or imag somewhere in them.
...
Done, we're now checking the fast-math flags (specifically contract)
llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h | ||
---|---|---|
37 | I don't think None is ever used, and from what I can see _Placeholder is a Shuffle? That might be a better name, either Shuffle or Leaf. I would also probably remove _Incomplete. They might make sense, but it is difficult to follow and very easy to get wrong. | |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
106 | I think it would be best to store the root Real and Imag values as part of the ComplexDeinterleavingCompositeNode. That way we can be sure we are matching a correct node from anything like getContainingComposite. | |
115 | Could AccumulatorNode just be another Operand? | |
116 | I think this would be best to avoid NodePtr, if that might keep a reference to a node alive. Avoiding shared_ptr entirely would be good, but I imagine any alternative might be difficult too. They needs to be stable values to avoid memory invalidation issues. | |
171 | Formatting needs updating. | |
220 | If NodeInstructions is used to calculate all the Instructions we have seen, to check that all the uses are inside the graph, it could be a SmallPtrSet which will have quicker contains(). | |
288 | It is hard to see how wrapping a Value as an _Incomplete node is always correct. It will have lost the information about whether V was real or imaginary. | |
576 | Do you have any tests for fneg? This should probably be more like the code in identifyPartialMul where the rotation is chosen based on whether the Real/Imag parts are negated. | |
673 | It's not the combination of rotates that is invalid really, although some might make less sense than others, and some might simplify to code that is difficult to match. The even rotations will define the real part of a, the odd rotations define the imag part. Between the two parts of the multiply we match we need to find both the real and imag halves of the value, to successfully match it further. (We may be able to do something with only half a match in some cases, like matching leafs, but that should probably be left for a later patch). | |
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll | ||
79 | Are you sure this is valid? |
Thanks for the updates. This is looking good now.
There are some extra comments below. I was trying some examples that have awkward orders and cross basicblocks, but couldn't find ways to make it break.
llvm/include/llvm/CodeGen/ComplexDeinterleavingPass.h | ||
---|---|---|
40–43 | This can now be re-flowed. | |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
20 | Can this comment be updated now? | |
560 | These can be // comments. They are very useful, but maybe it is not necessary to repeat them. | |
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll | ||
4 | Are there any tests with missing fast-math flags? | |
5 | I managed to come up with this testcase that failed because of the vector size. It is probably larger than it needs to be. It could work or not, depending if the backend handles non-power-2 sizes. In either rate, it would be good to add this example. (Once integers are added too, some odd size integer width tests would be good too). define arm_aapcs_vfpcc <12 x float> @abp90c12(<12 x float> %a, <12 x float> %b, <12 x float> %c) { entry: %ar = shufflevector <12 x float> %a, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %ai = shufflevector <12 x float> %a, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %br = shufflevector <12 x float> %b, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %bi = shufflevector <12 x float> %b, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %cr = shufflevector <12 x float> %c, <12 x float> poison, <6 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10> %ci = shufflevector <12 x float> %c, <12 x float> poison, <6 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11> %i6 = fmul fast <6 x float> %br, %ar %i7 = fmul fast <6 x float> %bi, %ai %xr = fsub fast <6 x float> %i6, %i7 %i9 = fmul fast <6 x float> %bi, %ar %i10 = fmul fast <6 x float> %br, %ai %xi = fadd fast <6 x float> %i9, %i10 %zr = fsub fast <6 x float> %cr, %xi %zi = fadd fast <6 x float> %ci, %xr %interleaved.vec = shufflevector <6 x float> %zr, <6 x float> %zi, <12 x i32> <i32 0, i32 6, i32 1, i32 7, i32 2, i32 8, i32 3, i32 9, i32 4, i32 10, i32 5, i32 11> ret <12 x float> %interleaved.vec } | |
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll | ||
79 | Is this just not transforming because of commutativity on the add? Can we add a test for fadd fast <2 x float> %strided.vec20, %strided.vec too, to show it does transform. |
Came here to say that this looks nice and would be worthy a mention in the release notes, so that can be added, I think.
Then I started reading the patch again a bit and couldn't help myself writing down some nits.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
17 | Nit: perhaps just omit in parallel? Or if you want to keep it, it could benefit from a clarification. | |
26 | I wouldn't mind a few more words/sentences about the internal datastructures, the graph, nodes, etc. | |
51 | Nit: remove deinterleaving from this description? | |
60 | Nit: newline | |
110 | Can you say what these instructions could be? | |
174 | I haven't looked carefully, but why do we also need to record all instructions here? I mean, the graph can be queried for existence of nodes/instructions, or is this bookkeeping duplication more efficient? | |
182 | Nit: no validation going on here, only insertion? | |
182 | Nit: no validation going on here, only insertion? | |
240 | Nit: newline? | |
313 | Nit: can omit the comparison with 1. | |
344 | Nit: I am a fan of reducing indentation, e.g.: auto *SVI = dyn_cast<ShuffleVectorInst>(&I); if (!SVI) continue; if (!isInterleavingMask(SVI->getShuffleMask()) continue; etc. | |
368 | Nit: maybe I overlooked something, but this doesn't seem to use any state from the class so could be just a helper function? | |
468 | Nit: I am wondering if these casts here and below are necessary. | |
484 | nit: maybe some or all of these node setting things can be done in a constructor. | |
540 | Nits: |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
174 | The duplication makes it simpler to check the internal uses of a node (See ComplexDeinterleavingCompositeNode::hasAllInternalUses). Without it, we'd need to iterate over the nodes twice, and check the instructions of each node (All of Real, Imag, and the contents of InternalInstructions), while this approach allows us to simply say "Is this instruction known within the graph?". | |
182 | Good catch, the validation that was here was moved to be more tied into the identification. Renamed | |
368 | It populates the compositeNode structure through (formerly) validateAndSubmitCompositeNode, and also calls identifyNode further, which does use state from the class. | |
468 | You are correct, good catch. | |
484 | Not sure I agree with that, there are cases - such as a Shuffle - where none of the other settings are relevant. The operation type, real instruction, and imaginary instructions are the only settings relevant to all cases. | |
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-mixed-cases.ll | ||
5 | Fixed this case (by checking on the backend whether the vector width is a power of 2), and added the test. | |
llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll | ||
79 | Yep, it's the commutativity. I've added a note to this test stating this, and added another test with the fadd operands reversed. |
llvm/include/llvm/InitializePasses.h | ||
---|---|---|
101 | You can undo these | |
111 | There is a newline here? | |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
413 | If you can turn some of these into better sentances that would be good. Perhaps "Mul instruction has multiple uses" like you have below. | |
482 | Formatting is a bit long. | |
llvm/lib/Target/ARM/ARMTargetMachine.cpp | ||
437 | This again. | |
llvm/test/CodeGen/ARM/O3-pipeline.ll | ||
2 | This looks like it was incorrectly added for this test. |
OK great. Thanks for the updates. I think this looks to be in a good state now. We can always do more nitpicking, and but it looks to be in a good state to get into tree. We can get this in now and add AArch64 support on top to increase the test coverage.
Some minor comments below but otherwise LGTM.
llvm/include/llvm/InitializePasses.h | ||
---|---|---|
108 | You can undo these. | |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
591 | Formatting. | |
llvm/lib/Target/ARM/ARMISelLowering.cpp | ||
21860 | Formatting. | |
21868 | Add a message to the assert |
Agreed, and thanks for working on this.
Question about AArch64, is that something you are going to work on next?
Yep, over at D129066. It's currently slightly behind this, but is mostly there already.