Page MenuHomePhabricator

[ARM][CodeGen] Add support for complex deinterleaving
Needs ReviewPublic

Authored by NickGuy on Nov 18 2021, 10:29 AM.

Details

Summary

Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
NickGuy updated this revision to Diff 416915.Mar 21 2022, 6:54 AM
NickGuy edited the summary of this revision. (Show Details)

It's been longer than I would've liked before I've gotten back to this, but I've;

  • Added support for matching partial complex multiplication patterns, where a complex number is multiplied by either the real or imaginary component.
  • Moved the lowering process from TTI to TLI.
  • Minimised the amount of data passed to the TLI, to only expose what is needed (and not implementation details).

I had also experimented with the idea of taking a case where complex numbers are multiplied by real numbers, but the
performance overhead of expanding the real number vector to match the expected pattern of our complex instructions (<a, _, b, _, c, _, ...>)
caused any gains to be marginal at best, and substantial regressions were seen almost across the board.

I don't intend on adding any new patterns to this patch, further patterns (and AArch64 support) will be implemented in follow-up patches.

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 6:54 AM

OK - we can keep this going - we know it should give some decent performance. We can see where we go from there if something else materializes that can give the same gains.

At a high level - what this pass is doing either seems to be too complicated or too simple. What it effectively seems to do is pattern match shuffle(fadd/fmul/etc(shuffle, shuffle)) is a way that matches one of several patterns, and generates target intrinsics from that. If that is all that is needed then the "Graph" that gets created doesn't seem necessary. It only ever stores a single node, so isn't much of a graph. It is being more complicated than it needs to be.

But I imagine in practice that a lot of math routines will do more than a single complex multiply. There will be multiple parts, this should be building a graph that is made up of several nodes bit at a time and generating code from it only if it can create a full graph that looks profitable to generate.

There are a number of other comments below.

llvm/include/llvm/CodeGen/ComplexArithmetic.h
1 ↗(On Diff #416915)

ComplexArithmetic is a bit of a general name for what this pass is trying to achieve. What about something like ComplexDeinterleavingPass? It better explains that the pass is performing the deinterleaving from shuffles.

20 ↗(On Diff #416915)

Is this needed? Its strange to a macro like this and I'm not sure it's very useful for the places it is used.

59 ↗(On Diff #416915)

It doesn't seem that it would be necessary to combine ComplexArithmeticOperation operations together. It doesn't make much sense to have something that is both an Add _and_ a Partial Mul.

61 ↗(On Diff #416915)

Is this just needed to pass data between the pass and the TLI? It's probably simpler (and more like the existing methods) to just pass the parameters individually.

llvm/include/llvm/CodeGen/Passes.h
5

I think this line can be removed. The other comments here are /// doxygen comments.

llvm/include/llvm/CodeGen/TargetLowering.h
23

This isn't needed if the target adds the pass itself. There will probably need to be a way to specify _which_ patterns the target supports for a given type. For example MVE has both integer and floating point complex operations. If the subtarget has only MVE (not MVE.fp), then it needs to support the integer complex operations, without supporting floating point. Other differences could exist like one architecture supporting a different subset of operations.

29

Is GeneratedIntrinsicCount just for statistics? If so it doesn't seem to be worth complicating the interface for.

llvm/lib/CodeGen/ComplexArithmetic.cpp
40 ↗(On Diff #416915)

This leaks memory. Same for the ones below. They can probably return a SmallVector<int>.

107 ↗(On Diff #416915)

Why are Real and Imaginary needed?

110 ↗(On Diff #416915)

Loads and Stores don't appear in the graph.

112 ↗(On Diff #416915)

Why is AddOperand needed? What does it mean?

113–114 ↗(On Diff #416915)

How is Preserve different to Input?

116 ↗(On Diff #416915)

What does this refer to?

119–120 ↗(On Diff #416915)

What does this mean and refer to?

321 ↗(On Diff #416915)

This looks like debugging code that should be removed. There is already an option to disable it.

660–661 ↗(On Diff #416915)

Remove commented out code.

703 ↗(On Diff #416915)

This shouldn't be something that is hard-coded in the pass. It might make more sense to push the splitting up of larger types to the target. There are cases like fixed-length SVE where the width won't be limited to 128bit vectors, and pushing that out to the target will be more general.

804 ↗(On Diff #416915)

This logic is all a bit strange. Why have Changed?

for (auto &I : *B) {
  if (auto *SVI = dyn_cast<ShuffleVectorInst>(&I)) {
    if (isInterleaving(SVI))
      if (traverseAndPopulateGraph(TLI, SVI, Graph))
        Substituted |= substituteGraph(TLI, &I, Graph, DeadInsts);
820 ↗(On Diff #416915)

I think this is just RecursivelyDeleteTriviallyDeadInstructions

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
15 ↗(On Diff #416915)

These are probably unneeded?

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll
3

I would use run lines that that are similar to the llvm/test/CodeGen/Thumb2/mve-xyz.ll tests. Something like:

; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs %s -o - | FileCheck %s

It's best not to use a specific cpu, using the arch instead.

The tests can probably go in the same place too, under llvm/test/CodeGen/Thumb2/mve-xyz.ll for mve tests.

6

This probably isn't worth testing if it is testing codegen already.

34

A lot of these tests are strange - they seem to have infinite loops?

I think you should be able to remove most of it to make a much cleaner test. It doesn't even need the loads/stores:

define <8 x half> @complex_add_v8f16(<8 x half> %a, <8 x half> %b) {
entry:
  %strided.vec = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec21 = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %strided.vec23 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec24 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %0 = fsub fast <4 x half> %strided.vec23, %strided.vec21
  %1 = fadd fast <4 x half> %strided.vec24, %strided.vec
  %interleaved.vec = shufflevector <4 x half> %0, <4 x half> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
  ret <8 x half> %interleaved.vec
}
NickGuy updated this revision to Diff 421520.Apr 8 2022, 7:24 AM
NickGuy marked 22 inline comments as done.
NickGuy edited the summary of this revision. (Show Details)
NickGuy added inline comments.
llvm/include/llvm/CodeGen/ComplexArithmetic.h
20 ↗(On Diff #416915)

It was more useful (though not necessarily needed) in an earlier iteration. I've removed it now.

59 ↗(On Diff #416915)

As above, useful in an earlier iteration. Also removed

llvm/lib/CodeGen/ComplexArithmetic.cpp
40 ↗(On Diff #416915)

Good catch, fixed.

119–120 ↗(On Diff #416915)

This entire enum is no longer required, so I've removed it all.

703 ↗(On Diff #416915)

I've pushed the 128 part to be provided by the target, but couldn't figure out a nice way to do the splitting entirely on the target. I'll be revisiting this when I have concrete examples of how scalable vectors will work.

804 ↗(On Diff #416915)

Holdover from an early iteration, where I tried to build the whole graph before changing anything. Simplified

820 ↗(On Diff #416915)

Looks to do the job, thanks

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll
34

These were initially generated from a larger IR, and pushed through llvm-reduce. I've rewritten them manually to be much smaller (basing them off of your snippet).

mgabka added a subscriber: mgabka.Apr 14 2022, 3:25 AM
huntergr added inline comments.May 30 2022, 2:03 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
483

I can't find any place where rotations are dealt with besides assignment and equality checks of 0,90,180,270 -- might an enum be preferable? Or do you anticipate doing arithmetic with them in a follow-up patch?

691

I think it would be worthwhile to add a comment about what the interleaving represents -- that you're looking for a shuffle that takes separate vectors of real and imaginary parts and combines them before they are stored to memory (or returned in registers), and that this is just for matching per-lane operations instead of cross-lane (like a reduction).

Or at least that's the behaviour I've observed when testing your patch with a loop like the following:

#define LEN (512)
float _Complex a[ LEN ];
float _Complex b[ LEN ];
float _Complex c[ LEN ];

void foo (void) {
#pragma clang loop vectorize(enable)
  for (int i = 0; i < LEN; ++i)
    a[i] = b[i] * c[i];
}
NickGuy updated this revision to Diff 438664.Jun 21 2022, 5:55 AM
NickGuy marked 2 inline comments as done.

Since the last patch, I've redesigned the high-level approach; It now crawls through the instruction graph to find opportunities for complex deinterleaving before attempting to perform said deinterleaving. Doing it this way allows us to short-circuit, bailing out and preventing the heavy work from being performed if something isn't supported.
This iteration also implements support for complex operations with multiple steps/operands (e.g. a[i] * b[i] * c[i])

I haven't really looked to deeply into the meat of the pass yet - how it does the matching - but I had a chunk of comments for the rest.

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes? It would build up a graph of complex nodes to replace the original shuffle, providing that the leaves were all valid de-interleavings. At least on paper I feel we should just be able to perform the search without looking though a lot of uses (except for checking there are no extra uses, of course), and all the Unmatched instructions and what looks like matching of random unrelated snippets would be cleaned up.

llvm/lib/CodeGen/CMakeLists.txt
49

Formatting is off here

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
1–2

This looks like it got formatted incorrectly.

99

item -> Item

100

i -> I (or a better name, if I is already used)

149

Reciprocal Throughput is more common.

511

evaluateComplexDeinterleavingBasicBlock -> evaluateBasicBlock

991–992

It is better to structure things in a way where we decide whether to do something, then do it. Not get half way through doing it and decide we didn't want to in the end. In what ways would we expect createComplexDeinterleavingIR to return nullptr at the moment?

1017

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

llvm/lib/Target/ARM/ARMISelLowering.cpp
21738–21742

Things like this can be a single if:

if (!VTy || VTy->getNumElements() * VTy->getScalarSizeInBits() != 128)
    return false;

I presume the == 128 can be removed if we teach it how to split the vectors up?

21773

Can this be an assert instead.

llvm/lib/Target/ARM/ARMTargetMachine.cpp
437

This doesn't need to add the new brackets.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-mul.ll
32

Add arm_aapcs_vfpcc to any tests that take or return vectors.
I think you can remove #0 too.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f64-mul.ll
2

It might be a little cleaner to add +fp64 for all these f64 tests.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-rotations-add.ll
1 ↗(On Diff #421520)

MVE tests can go into llvm/test/CodeGen/Thumb2/mve-complex-xyz.ll. So long as they are all updated by the test script, that should be fine to keep them with the rest.

NickGuy added a comment.EditedJun 22 2022, 4:02 AM

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?

While that would work "simply" enough for cases like a * b, more elaborate cases (e.g. (a * b) * c) would result in some ambiguity as to which add(mul, mul) pairs with which sub(mul, mul). This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?

Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like DAGTypeLegalizer::SplitVectorResult).

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Remaining comments will be addressed in a follow-up patch.

chill added a subscriber: chill.Jun 28 2022, 3:28 AM
chill added inline comments.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
136

Here and elsewhere std::find -> llvm::find

593

Why are we scanning the whole block?
It looks to me (admittedly I don't understand what this pass does yet)
we can just walk over the Instructions vector (in reverse, if the order matters),
avoiding the quadratic complexity.

633

Here and in a few other places for (Value *V : Op0->operands()) {

chill added a comment.Jun 28 2022, 3:30 AM

Do we have some broad overview of the approach and the algorithm? It'd be a good idea put something like this in the description and eventually the commit message. (I searched for LLVM complex RFCs, buy couldn't find anything useful).

chill added inline comments.Jun 28 2022, 7:42 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
325

return std::make_shared<ComplexDeinterleavingCompositeNode>(Operation);

561

These could be, e.g.:

static const int RealMask[] = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30};
auto RealMaskRef = ArrayRef<int>(RealMask, ShufleMask.size());

with an assertion/bounds check. Good enough for 512-bit vectors with 16-bit elements, can be extended.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99

Shouldn't these be translated to a couple of vcadd.f32 instructions, like in the previous test?
And this amount of move instructions seems excessive.

chill added inline comments.Jun 28 2022, 7:46 AM
llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99

And this amount of move instructions seems excessive.

I guess MVE does not have sensible swizzling instructions.

What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?

While that would work "simply" enough for cases like a * b, more elaborate cases (e.g. (a * b) * c) would result in some ambiguity as to which add(mul, mul) pairs with which sub(mul, mul). This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

Hmm, OK. I was thinking about a three way multiply when looking at that way of structuring the pass, but only one pattern and only ever on paper. I was worried that there were cases where it was ambiguous, but figured if it was it could always just try both possibilities. But I've not implemented it, it just sounded like an elegant way of treating this like slightly more complex pattern matcher, as opposed to all this findUnmatchedInstructions and looking through uses.

A broad overview of the algorithm would be useful, like Momchil mentioned, perhaps in the file description. I was trying to make a sketch before but got a bit lost in nested loops.

I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?

Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like DAGTypeLegalizer::SplitVectorResult).

A separate addition sounds good. Lets try and get something that we happy with and extend it out.

Is this cost necessary at the moment, or will it always be profitable for the simple cases?

Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Do you have a test-case where this happens? Cost modelling will probably be good to add in the long run, I'd just like to understand where it currently can make things worse.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99

Yeah this is expected from shuffles that MVE cannot handle very well. It would look a lot better either with loads that could become interleaving loads, or under AArch64 where better shuffles are available.

NickGuy updated this revision to Diff 442045.Jul 4 2022, 2:55 AM
NickGuy marked 6 inline comments as done.
NickGuy retitled this revision from [ARM][CodeGen] Add support for complex addition and multiplication to [ARM][CodeGen] Add support for complex deinterleaving.

Addressed comments, and added a file description that attempts to explain what the pass does.

NickGuy marked 15 inline comments as done.Jul 4 2022, 3:42 AM
NickGuy added inline comments.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
593

The order of the CompositeNodes is important, this was a holdover from before we were sorting the nodes.

llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll
99

Shouldn't these be translated to a couple of vcadd.f32 instructions, like in the previous test?

Not in this case, no. The IR vectors are too wide to fit into actual vector registers, and the follow-up vector splitting patch (D129067) is restricted to splitting in half.

Some of the comments looked like they didn't get resolved, and the latest version is missing context.

It would be good to get to the point where the tests are in a good place and we can commit those, allowing us to just show just the differences here. They need to be in the right place first though.

NickGuy updated this revision to Diff 449946.Thu, Aug 4, 6:00 AM
NickGuy added a comment.EditedThu, Aug 4, 6:01 AM

I've pre-committed the tests, and redesigned the core algorthm to operate on pairs of operands, rather than analysing the uses of a given instruction. The vector splitting has also been implemented on the target side

The tests are good, but they currently only cover each of the types for vcadd and vcmla0/90. I think we need more to show all the edgecases that should or should not be matched. They are not needed for all types, but we should try and make examples of each.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
163

ConvergingI is the Root of the graph? Perhaps just call it RootValue.

529

What if the operands are FMul(A, A) and FMul (B, C)?
I think it needs to be more precise about which nodes are which.

564

I don't think it makes sense to treat these backwards. We know the CR part should be real and the CI is imaginary. It would only be valid to treat them the other way around if there was some sort of shuffle added.

593

Does this need to be calling identifyNode on the third argument too? The one derived from UncommonValues.

656

This looks like it needs to match:

  • RealI and ImagI are both shuffles.
  • They both have the same Operand0
  • They both have "deinterleaving" masks.

I don't think the type of the value of Operand0 otherwise matters. It doesn't matter if it is a Load or a Argument, we can always just use it.

749

It's best not to use static data like this, we can make it more generic. The match can be m_Shuffle(..., m_Mask(Mask)), then check that the Mask is an isDeinterleavingMask. It does need to check _which_ deinterleaving mask it is though for the Real/Imaginary parts.

854

If the Operands that CN depends on are included in the Node, then we can just walk up the tree making sure we create the Operands before the Nodes that use them, using the Value's that the operands produce as the Input0/Input1/Accumulator below.
That avoids the need to "wrangle" any inputs at this stage, as we already know the nodes we need.

869

It might be simpler to remove anything Cost based from this revision, adding it back in if it is needed in followup patches.
All the identification that happens in this patch should always be cheaper, as far as I understand.