This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower fused complex multiply-add intrinsic to AArch64::FCMA
Needs ReviewPublic

Authored by nujaa on Apr 11 2023, 9:42 PM.

Details

Summary

Inspired by the fmuladd intrinsic, I added the fcmuladd intrinsic lowering to a combination of AArch64 FCMA. This aims to enable vectorised complex operation.

Diff Detail

Event Timeline

nujaa created this revision.Apr 11 2023, 9:42 PM
Herald added a reviewer: ftynse. · View Herald Transcript
Herald added a reviewer: dcaballe. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
nujaa requested review of this revision.Apr 11 2023, 9:42 PM
nujaa updated this revision to Diff 512654.Apr 11 2023, 10:04 PM
nujaa removed reviewers: dcaballe, nicolasvasilache.

Removed MLIR-related code.

nujaa added a comment.EditedApr 11 2023, 11:57 PM

git-clang-format is unhappy because of llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp formatting generates something like :

case ISD::FMA:                        return "fma";
case ISD::FCMA:
  return "fcma";
case ISD::STRICT_FMA:                return "strict_fma";
case ISD::FMAD:                      return "fmad";

Which does not follow the current formatting of the file. What do you suggest ?

git-clang-format is unhappy because of llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp formatting generates something like :

case ISD::FMA:                        return "fma";
case ISD::FCMA:
  return "fcma";
case ISD::STRICT_FMA:                return "strict_fma";
case ISD::FMAD:                      return "fmad";

Which does not follow the current formatting of the file. What do you suggest ?

Keep the existing formatting.

Need to update docs/LangRef.rst

Could you clarify if there will be additional work in the future? The thing is there is a pass at llvm/lib/CodeGen/ComplexDeinterleaving.cpp that generates FCMLA/FCADD architecture specific intrinsics using TargetLower::createComplexDeinterleavingIR

Not sure I agree with having a high-level complex intrinsic (though if done right, I'm not completely against the idea); It locks the IR into the concept of a complex multiply, rather than the individual instructions that make it up. This could result in lower net performance as other optimisation passes aren't able to see how the intrinsic functions internally, meaning they can't apply their optimisations.
I don't think having a common intrinsic that maps to only one or two backends is worth supporting, especially as mentioned by Igor, llvm/lib/CodeGen/ComplexDeinterleaving.cpp handles this stuff on a per-target basis already.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5778–5789

I may be missing it; But where do you handle targets that don't support complex instructions? I appreciate that nothing should be generating FCMAs yet, but if something matches a complex multiply-accumulate and emits one on a target that doesn't support it, the fcma should ideally be transformed to the equivalent fp operations (shuffle(add(mul, mul), sub(mul, mul)))

Also, this should probably be moved to its own function, if for nothing else than to match the style of the rest of the switch.

mgabka added a subscriber: mgabka.Apr 12 2023, 4:41 AM
nujaa added a comment.Apr 12 2023, 6:26 PM

Could you clarify if there will be additional work in the future? The thing is there is a pass at llvm/lib/CodeGen/ComplexDeinterleaving.cpp that generates FCMLA/FCADD architecture specific intrinsics using TargetLower::createComplexDeinterleavingIR

Hi,
I did not know about this pass, I will look into it and check whether this fits our needs. Thank you.
As additional work within LLVM, we added complex multiplication without accumulator and conjugate recognition to fuse with the FCMLA.
For a bit of context, we are generating this complex code from MLIR where we handle vectors of complex. It currently works but is not ready for upstreaming.

Not sure I agree with having a high-level complex intrinsic (though if done right, I'm not completely against the idea); It locks the IR into the concept of a complex multiply, rather than the individual instructions that make it up. This could result in lower net performance as other optimisation passes aren't able to see how the intrinsic functions internally, meaning they can't apply their optimisations.

For performances, In our use case of BLAS libraries, we manage to reach better performance than hand optimised assembly on caxpy, cgemv and cgemm.

" But where do you handle targets that don't support complex instructions? I appreciate that nothing should be generating FCMAs yet.

I indeed have not added the support for other architectures / architectures not supporting complex multiply-accumulate for this exact reason. For now, there are no pattern matching generating this intrinsic. This will be required before pushing the MLIR side generating them.

For a bit of context, we are generating this complex code from MLIR where we handle vectors of complex.
For performances, In our use case of BLAS libraries, we manage to reach better performance than hand optimised assembly on caxpy, cgemv and cgemm.

I'd be interested to see how the performance of this differs from what the ComplexDeinterleavingPass emits, or if the patterns aren't recognised by the pass, why that might be.

nujaa added a comment.EditedApr 16 2023, 9:03 PM

It locks the IR into the concept of a complex multiply, rather than the individual instructions that make it up

Do you mean you would rather see let's say a fcmul intrinsic representing a vector complex multiplication which would eventually be fused with an addf ?

I'd be interested to see how the performance of this differs from what the ComplexDeinterleavingPass emits, or if the patterns aren't recognised by the pass, why that might be.

Hi, I realised your patch was not yet upstreamed when I created these changes explaining the schism. Also, the patterns would not be recognised anyway because MLIR does not support vectors of complex and our improvised lowering that works for our usecase does not generate them as shuffles + computation op. Which might be preferred.
So, Next steps are : I'll try generating complex operations as shuffle + computation ops as your implementation suggests and let you know of the performances.
To validate your implementation for my solution, I'll will also need to implement conjugate fusing and commutativity (as rotation only affects one operand and I need to be able to conjugate both operands to reach my own target performances). Eventually we could have something like

define <4 x float> @complex_mul_v4f32(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: complex_mul_v4f32:
; CHECK:       // %bb.0: // %entry
; CHECK-NEXT:    movi v2.2d, #0000000000000000
; CHECK-NEXT:    fcmla v2.4s, v0.4s, v1.4s, #0
; CHECK-NEXT:    fcmla v2.4s, v0.4s, v1.4s, #270
; CHECK-NEXT:    mov v0.16b, v2.16b
; CHECK-NEXT:    ret
entry:
  %a.real = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %a.imag = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %b.real = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %b.imag = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %a.conj = fneg <2 x float> %a.imag
  %0 = fmul fast <2 x float> %b.imag, %a.real
  %1 = fmul fast <2 x float> %b.real, %a.conj
  %2 = fadd fast <2 x float> %1, %0
  %3 = fmul fast <2 x float> %b.real, %a.real
  %4 = fmul fast <2 x float> %a.conj, %b.imag
  %5 = fsub fast <2 x float> %3, %4
  %interleaved.vec = shufflevector <2 x float> %5, <2 x float> %2, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

I am afraid of one thing, It is the explosion of usecases. Please tell me if I miss something or if I'm wrong but at LLVM level, there is a use case

  • for each combination of sub/add (4 cases) (example is mul_mul_with_fneg in complex-deinterleaving-uniform-cases.ll)
  • for negated operands similar to previous example but represented differently (2 cases) ex: neg(a) x b; => fcmla a, b, #0; fcmla a, b, #270 [+ potentially neg(a) x neg(b) => axb]
  • for conjugated operands (2 cases) (example above) [+ potentially conj(a) x conj(b) => conj(axb)]

Which leads us to 16 cases multiplied by 2 if we take care of the commutativity. Maybe generating an intermediate complex multiplication target specific ISD would help us hide out the combinations of sub/adds.

At asm level, recognising complex multiplication and fusing other operations becomes quite cumbersome because of the combinations of rotations and recognising common operands between vcmlas, we might want to avoid pattern matching there.
What do you think ?

Out of curiosity, where do you generate your shuffles from ?