This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
CodeGen/
-
ISDOpcodes.h
-
IR/
-
Intrinsics.td
-
Target/
-
TargetSelectionDAG.td
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
-
LegalizeDAG.cpp
-
LegalizeVectorOps.cpp
-
LegalizeVectorTypes.cpp
-
SelectionDAG.cpp
-
SelectionDAGBuilder.cpp
-
SelectionDAGDumper.cpp
-
Target/AArch64/
-
AArch64/
1
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
complex-intrinsics.ll

Differential D148068

[AArch64] Lower fused complex multiply-add intrinsic to AArch64::FCMA
Needs ReviewPublic

Authored by nujaa on Apr 11 2023, 9:42 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
aartbik
ftynse

Summary

Inspired by the fmuladd intrinsic, I added the fcmuladd intrinsic lowering to a combination of AArch64 FCMA. This aims to enable vectorised complex operation.

Diff Detail

Event Timeline

nujaa created this revision.Apr 11 2023, 9:42 PM

Herald added a reviewer: aartbik. · View Herald TranscriptApr 11 2023, 9:42 PM

Herald added a reviewer: ftynse. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bviyer, Moerafaat, zero9178 and 25 others. · View Herald Transcript

nujaa requested review of this revision.Apr 11 2023, 9:42 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptApr 11 2023, 9:42 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, stephenneuendorffer, nicolasvasilache, jdoerfert. · View Herald Transcript

Removed MLIR-related code.

Harbormaster completed remote builds in B224926: Diff 512654.Apr 11 2023, 11:12 PM

git-clang-format is unhappy because of llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp formatting generates something like :

case ISD::FMA:                        return "fma";
case ISD::FCMA:
  return "fcma";
case ISD::STRICT_FMA:                return "strict_fma";
case ISD::FMAD:                      return "fmad";

Which does not follow the current formatting of the file. What do you suggest ?

In D148068#4260104, @nujaa wrote:
git-clang-format is unhappy because of llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp formatting generates something like :
case ISD::FMA:                        return "fma";
case ISD::FCMA:
  return "fcma";
case ISD::STRICT_FMA:                return "strict_fma";
case ISD::FMAD:                      return "fmad";
Which does not follow the current formatting of the file. What do you suggest ?

Keep the existing formatting.

Need to update docs/LangRef.rst

Could you clarify if there will be additional work in the future? The thing is there is a pass at llvm/lib/CodeGen/ComplexDeinterleaving.cpp that generates FCMLA/FCADD architecture specific intrinsics using TargetLower::createComplexDeinterleavingIR

NickGuy added a subscriber: NickGuy.Apr 12 2023, 3:25 AM

Not sure I agree with having a high-level complex intrinsic (though if done right, I'm not completely against the idea); It locks the IR into the concept of a complex multiply, rather than the individual instructions that make it up. This could result in lower net performance as other optimisation passes aren't able to see how the intrinsic functions internally, meaning they can't apply their optimisations.
I don't think having a common intrinsic that maps to only one or two backends is worth supporting, especially as mentioned by Igor, llvm/lib/CodeGen/ComplexDeinterleaving.cpp handles this stuff on a per-target basis already.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5778–5789	I may be missing it; But where do you handle targets that don't support complex instructions? I appreciate that nothing should be generating FCMAs yet, but if something matches a complex multiply-accumulate and emits one on a target that doesn't support it, the fcma should ideally be transformed to the equivalent fp operations (`shuffle(add(mul, mul), sub(mul, mul))`) Also, this should probably be moved to its own function, if for nothing else than to match the style of the rest of the switch.

mgabka added a subscriber: mgabka.Apr 12 2023, 4:41 AM

In D148068#4260805, @igor.kirillov wrote:

Could you clarify if there will be additional work in the future? The thing is there is a pass at llvm/lib/CodeGen/ComplexDeinterleaving.cpp that generates FCMLA/FCADD architecture specific intrinsics using TargetLower::createComplexDeinterleavingIR

Hi,
I did not know about this pass, I will look into it and check whether this fits our needs. Thank you.
As additional work within LLVM, we added complex multiplication without accumulator and conjugate recognition to fuse with the FCMLA.
For a bit of context, we are generating this complex code from MLIR where we handle vectors of complex. It currently works but is not ready for upstreaming.

In D148068#4260921, @NickGuy wrote:

Not sure I agree with having a high-level complex intrinsic (though if done right, I'm not completely against the idea); It locks the IR into the concept of a complex multiply, rather than the individual instructions that make it up. This could result in lower net performance as other optimisation passes aren't able to see how the intrinsic functions internally, meaning they can't apply their optimisations.

For performances, In our use case of BLAS libraries, we manage to reach better performance than hand optimised assembly on caxpy, cgemv and cgemm.

" But where do you handle targets that don't support complex instructions? I appreciate that nothing should be generating FCMAs yet.

I indeed have not added the support for other architectures / architectures not supporting complex multiply-accumulate for this exact reason. For now, there are no pattern matching generating this intrinsic. This will be required before pushing the MLIR side generating them.

awarzynski removed a subscriber: awarzynski.Apr 13 2023, 12:19 AM

For a bit of context, we are generating this complex code from MLIR where we handle vectors of complex.
For performances, In our use case of BLAS libraries, we manage to reach better performance than hand optimised assembly on caxpy, cgemv and cgemm.

I'd be interested to see how the performance of this differs from what the ComplexDeinterleavingPass emits, or if the patterns aren't recognised by the pass, why that might be.

It locks the IR into the concept of a complex multiply, rather than the individual instructions that make it up

Do you mean you would rather see let's say a fcmul intrinsic representing a vector complex multiplication which would eventually be fused with an addf ?

I'd be interested to see how the performance of this differs from what the ComplexDeinterleavingPass emits, or if the patterns aren't recognised by the pass, why that might be.

Hi, I realised your patch was not yet upstreamed when I created these changes explaining the schism. Also, the patterns would not be recognised anyway because MLIR does not support vectors of complex and our improvised lowering that works for our usecase does not generate them as shuffles + computation op. Which might be preferred.
So, Next steps are : I'll try generating complex operations as shuffle + computation ops as your implementation suggests and let you know of the performances.
To validate your implementation for my solution, I'll will also need to implement conjugate fusing and commutativity (as rotation only affects one operand and I need to be able to conjugate both operands to reach my own target performances). Eventually we could have something like

define <4 x float> @complex_mul_v4f32(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: complex_mul_v4f32:
; CHECK:       // %bb.0: // %entry
; CHECK-NEXT:    movi v2.2d, #0000000000000000
; CHECK-NEXT:    fcmla v2.4s, v0.4s, v1.4s, #0
; CHECK-NEXT:    fcmla v2.4s, v0.4s, v1.4s, #270
; CHECK-NEXT:    mov v0.16b, v2.16b
; CHECK-NEXT:    ret
entry:
  %a.real = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %a.imag = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %b.real = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
  %b.imag = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
  %a.conj = fneg <2 x float> %a.imag
  %0 = fmul fast <2 x float> %b.imag, %a.real
  %1 = fmul fast <2 x float> %b.real, %a.conj
  %2 = fadd fast <2 x float> %1, %0
  %3 = fmul fast <2 x float> %b.real, %a.real
  %4 = fmul fast <2 x float> %a.conj, %b.imag
  %5 = fsub fast <2 x float> %3, %4
  %interleaved.vec = shufflevector <2 x float> %5, <2 x float> %2, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
  ret <4 x float> %interleaved.vec
}

I am afraid of one thing, It is the explosion of usecases. Please tell me if I miss something or if I'm wrong but at LLVM level, there is a use case

for each combination of sub/add (4 cases) (example is mul_mul_with_fneg in complex-deinterleaving-uniform-cases.ll)
for negated operands similar to previous example but represented differently (2 cases) ex: neg(a) x b; => fcmla a, b, #0; fcmla a, b, #270 [+ potentially neg(a) x neg(b) => axb]
for conjugated operands (2 cases) (example above) [+ potentially conj(a) x conj(b) => conj(axb)]

Which leads us to 16 cases multiplied by 2 if we take care of the commutativity. Maybe generating an intermediate complex multiplication target specific ISD would help us hide out the combinations of sub/adds.

At asm level, recognising complex multiplication and fusing other operations becomes quite cumbersome because of the combinations of rotations and recognising common operands between vcmlas, we might want to avoid pattern matching there.
What do you think ?

Out of curiosity, where do you generate your shuffles from ?

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

ISDOpcodes.h

3 lines

IR/

Intrinsics.td

6 lines

Target/

TargetSelectionDAG.td

1 line

lib/

CodeGen/

SelectionDAG/

LegalizeDAG.cpp

1 line

LegalizeVectorOps.cpp

1 line

LegalizeVectorTypes.cpp

3 lines

SelectionDAG.cpp

8 lines

SelectionDAGBuilder.cpp

8 lines

SelectionDAGDumper.cpp

1 line

Target/

AArch64/

AArch64ISelLowering.cpp

14 lines

test/

CodeGen/

AArch64/

complex-intrinsics.ll

67 lines

Diff 512654

llvm/include/llvm/CodeGen/ISDOpcodes.h

Show First 20 Lines • Show All 479 Lines • ▼ Show 20 Lines	enum NodeType {

/// FMA - Perform a * b + c with no intermediate rounding step.		/// FMA - Perform a * b + c with no intermediate rounding step.
FMA,		FMA,

/// FMAD - Perform a * b + c, while getting the same result as the		/// FMAD - Perform a * b + c, while getting the same result as the
/// separately rounded operations.		/// separately rounded operations.
FMAD,		FMAD,

		/// FCMA - Perform complex a * b + c with no intermediate rounding step.
		FCMA,

/// FCOPYSIGN(X, Y) - Return the value of X with the sign of Y. NOTE: This		/// FCOPYSIGN(X, Y) - Return the value of X with the sign of Y. NOTE: This
/// DAG node does not require that X and Y have the same type, just that		/// DAG node does not require that X and Y have the same type, just that
/// they are both floating point. X and the result must have the same type.		/// they are both floating point. X and the result must have the same type.
/// FCOPYSIGN(f32, f64) is allowed.		/// FCOPYSIGN(f32, f64) is allowed.
FCOPYSIGN,		FCOPYSIGN,

/// INT = FGETSIGN(FP) - Return the sign bit of the specified floating point		/// INT = FGETSIGN(FP) - Return the sign bit of the specified floating point
/// value as an integer 0/1 value.		/// value as an integer 0/1 value.
▲ Show 20 Lines • Show All 1,050 Lines • Show Last 20 Lines

llvm/include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 2,012 Lines • ▼ Show 20 Lines	let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
def int_vector_reduce_umin : DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],		def int_vector_reduce_umin : DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
[llvm_anyvector_ty]>;		[llvm_anyvector_ty]>;
def int_vector_reduce_fmax : DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],		def int_vector_reduce_fmax : DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
[llvm_anyvector_ty]>;		[llvm_anyvector_ty]>;
def int_vector_reduce_fmin : DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],		def int_vector_reduce_fmin : DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
[llvm_anyvector_ty]>;		[llvm_anyvector_ty]>;
}		}

		//===----- Complex math intrinsics ----------------------------------------===//

		def int_fcmuladd: DefaultAttrsIntrinsic<[llvm_anyfloat_ty],
		[LLVMMatchType<0>, LLVMMatchType<0>,
		LLVMMatchType<0>]>;

//===----- Matrix intrinsics ---------------------------------------------===//		//===----- Matrix intrinsics ---------------------------------------------===//

def int_matrix_transpose		def int_matrix_transpose
: DefaultAttrsIntrinsic<[llvm_anyvector_ty],		: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>, llvm_i32_ty, llvm_i32_ty],		[LLVMMatchType<0>, llvm_i32_ty, llvm_i32_ty],
[ IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>,		[ IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>,
ImmArg<ArgIndex<2>>]>;		ImmArg<ArgIndex<2>>]>;

▲ Show 20 Lines • Show All 199 Lines • Show Last 20 Lines

llvm/include/llvm/Target/TargetSelectionDAG.td

	Show First 20 Lines • Show All 471 Lines • ▼ Show 20 Lines

	def fadd : SDNode<"ISD::FADD" , SDTFPBinOp, [SDNPCommutative]>;			def fadd : SDNode<"ISD::FADD" , SDTFPBinOp, [SDNPCommutative]>;
	def fsub : SDNode<"ISD::FSUB" , SDTFPBinOp>;			def fsub : SDNode<"ISD::FSUB" , SDTFPBinOp>;
	def fmul : SDNode<"ISD::FMUL" , SDTFPBinOp, [SDNPCommutative]>;			def fmul : SDNode<"ISD::FMUL" , SDTFPBinOp, [SDNPCommutative]>;
	def fdiv : SDNode<"ISD::FDIV" , SDTFPBinOp>;			def fdiv : SDNode<"ISD::FDIV" , SDTFPBinOp>;
	def frem : SDNode<"ISD::FREM" , SDTFPBinOp>;			def frem : SDNode<"ISD::FREM" , SDTFPBinOp>;
	def fma : SDNode<"ISD::FMA" , SDTFPTernaryOp, [SDNPCommutative]>;			def fma : SDNode<"ISD::FMA" , SDTFPTernaryOp, [SDNPCommutative]>;
	def fmad : SDNode<"ISD::FMAD" , SDTFPTernaryOp, [SDNPCommutative]>;			def fmad : SDNode<"ISD::FMAD" , SDTFPTernaryOp, [SDNPCommutative]>;
				def fcma : SDNode<"ISD::FCMA" , SDTFPTernaryOp, [SDNPCommutative]>;
	def fabs : SDNode<"ISD::FABS" , SDTFPUnaryOp>;			def fabs : SDNode<"ISD::FABS" , SDTFPUnaryOp>;
	def fminnum : SDNode<"ISD::FMINNUM" , SDTFPBinOp,			def fminnum : SDNode<"ISD::FMINNUM" , SDTFPBinOp,
	[SDNPCommutative, SDNPAssociative]>;			[SDNPCommutative, SDNPAssociative]>;
	def fmaxnum : SDNode<"ISD::FMAXNUM" , SDTFPBinOp,			def fmaxnum : SDNode<"ISD::FMAXNUM" , SDTFPBinOp,
	[SDNPCommutative, SDNPAssociative]>;			[SDNPCommutative, SDNPAssociative]>;
	def fminnum_ieee : SDNode<"ISD::FMINNUM_IEEE", SDTFPBinOp,			def fminnum_ieee : SDNode<"ISD::FMINNUM_IEEE", SDTFPBinOp,
	[SDNPCommutative]>;			[SDNPCommutative]>;
	def fmaxnum_ieee : SDNode<"ISD::FMAXNUM_IEEE", SDTFPBinOp,			def fmaxnum_ieee : SDNode<"ISD::FMAXNUM_IEEE", SDTFPBinOp,
	▲ Show 20 Lines • Show All 1,405 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

Show First 20 Lines • Show All 4,808 Lines • ▼ Show 20 Lines	case ISD::STRICT_FPOW:
Tmp1 = DAG.getNode(Node->getOpcode(), dl, {NVT, MVT::Other},		Tmp1 = DAG.getNode(Node->getOpcode(), dl, {NVT, MVT::Other},
{Tmp3, Tmp1, Tmp2});		{Tmp3, Tmp1, Tmp2});
Tmp1 = DAG.getNode(ISD::STRICT_FP_ROUND, dl, {OVT, MVT::Other},		Tmp1 = DAG.getNode(ISD::STRICT_FP_ROUND, dl, {OVT, MVT::Other},
{Tmp1.getValue(1), Tmp1, DAG.getIntPtrConstant(0, dl)});		{Tmp1.getValue(1), Tmp1, DAG.getIntPtrConstant(0, dl)});
Results.push_back(Tmp1);		Results.push_back(Tmp1);
Results.push_back(Tmp1.getValue(1));		Results.push_back(Tmp1.getValue(1));
break;		break;
case ISD::FMA:		case ISD::FMA:
		case ISD::FCMA:
Tmp1 = DAG.getNode(ISD::FP_EXTEND, dl, NVT, Node->getOperand(0));		Tmp1 = DAG.getNode(ISD::FP_EXTEND, dl, NVT, Node->getOperand(0));
Tmp2 = DAG.getNode(ISD::FP_EXTEND, dl, NVT, Node->getOperand(1));		Tmp2 = DAG.getNode(ISD::FP_EXTEND, dl, NVT, Node->getOperand(1));
Tmp3 = DAG.getNode(ISD::FP_EXTEND, dl, NVT, Node->getOperand(2));		Tmp3 = DAG.getNode(ISD::FP_EXTEND, dl, NVT, Node->getOperand(2));
Results.push_back(		Results.push_back(
DAG.getNode(ISD::FP_ROUND, dl, OVT,		DAG.getNode(ISD::FP_ROUND, dl, OVT,
DAG.getNode(Node->getOpcode(), dl, NVT, Tmp1, Tmp2, Tmp3),		DAG.getNode(Node->getOpcode(), dl, NVT, Tmp1, Tmp2, Tmp3),
DAG.getIntPtrConstant(0, dl, /isTarget=/true)));		DAG.getIntPtrConstant(0, dl, /isTarget=/true)));
break;		break;
▲ Show 20 Lines • Show All 330 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp

Show First 20 Lines • Show All 379 Lines • ▼ Show 20 Lines	#include "llvm/IR/ConstrainedOps.def"
case ISD::FRINT:		case ISD::FRINT:
case ISD::FNEARBYINT:		case ISD::FNEARBYINT:
case ISD::FROUND:		case ISD::FROUND:
case ISD::FROUNDEVEN:		case ISD::FROUNDEVEN:
case ISD::FFLOOR:		case ISD::FFLOOR:
case ISD::FP_ROUND:		case ISD::FP_ROUND:
case ISD::FP_EXTEND:		case ISD::FP_EXTEND:
case ISD::FMA:		case ISD::FMA:
		case ISD::FCMA:
case ISD::SIGN_EXTEND_INREG:		case ISD::SIGN_EXTEND_INREG:
case ISD::ANY_EXTEND_VECTOR_INREG:		case ISD::ANY_EXTEND_VECTOR_INREG:
case ISD::SIGN_EXTEND_VECTOR_INREG:		case ISD::SIGN_EXTEND_VECTOR_INREG:
case ISD::ZERO_EXTEND_VECTOR_INREG:		case ISD::ZERO_EXTEND_VECTOR_INREG:
case ISD::SMIN:		case ISD::SMIN:
case ISD::SMAX:		case ISD::SMAX:
case ISD::UMIN:		case ISD::UMIN:
case ISD::UMAX:		case ISD::UMAX:
▲ Show 20 Lines • Show All 1,327 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	#endif
case ISD::SHL:		case ISD::SHL:
case ISD::SRA:		case ISD::SRA:
case ISD::SRL:		case ISD::SRL:
case ISD::ROTL:		case ISD::ROTL:
case ISD::ROTR:		case ISD::ROTR:
R = ScalarizeVecRes_BinOp(N);		R = ScalarizeVecRes_BinOp(N);
break;		break;
case ISD::FMA:		case ISD::FMA:
		case ISD::FCMA:
case ISD::FSHL:		case ISD::FSHL:
case ISD::FSHR:		case ISD::FSHR:
R = ScalarizeVecRes_TernaryOp(N);		R = ScalarizeVecRes_TernaryOp(N);
break;		break;

#define DAG_INSTRUCTION(NAME, NARG, ROUND_MODE, INTRINSIC, DAGN) \		#define DAG_INSTRUCTION(NAME, NARG, ROUND_MODE, INTRINSIC, DAGN) \
case ISD::STRICT_##DAGN:		case ISD::STRICT_##DAGN:
#include "llvm/IR/ConstrainedOps.def"		#include "llvm/IR/ConstrainedOps.def"
▲ Show 20 Lines • Show All 952 Lines • ▼ Show 20 Lines	#endif
case ISD::SSHLSAT:		case ISD::SSHLSAT:
case ISD::USHLSAT:		case ISD::USHLSAT:
case ISD::ROTL:		case ISD::ROTL:
case ISD::ROTR:		case ISD::ROTR:
case ISD::VP_FCOPYSIGN:		case ISD::VP_FCOPYSIGN:
SplitVecRes_BinOp(N, Lo, Hi);		SplitVecRes_BinOp(N, Lo, Hi);
break;		break;
case ISD::FMA: case ISD::VP_FMA:		case ISD::FMA: case ISD::VP_FMA:
		case ISD::FCMA:
case ISD::FSHL:		case ISD::FSHL:
case ISD::VP_FSHL:		case ISD::VP_FSHL:
case ISD::FSHR:		case ISD::FSHR:
case ISD::VP_FSHR:		case ISD::VP_FSHR:
SplitVecRes_TernaryOp(N, Lo, Hi);		SplitVecRes_TernaryOp(N, Lo, Hi);
break;		break;

#define DAG_INSTRUCTION(NAME, NARG, ROUND_MODE, INTRINSIC, DAGN) \		#define DAG_INSTRUCTION(NAME, NARG, ROUND_MODE, INTRINSIC, DAGN) \
▲ Show 20 Lines • Show All 3,024 Lines • ▼ Show 20 Lines	#include "llvm/IR/ConstrainedOps.def"
case ISD::VP_FROUNDEVEN:		case ISD::VP_FROUNDEVEN:
case ISD::VP_FROUNDTOZERO:		case ISD::VP_FROUNDTOZERO:
case ISD::FREEZE:		case ISD::FREEZE:
case ISD::ARITH_FENCE:		case ISD::ARITH_FENCE:
case ISD::FCANONICALIZE:		case ISD::FCANONICALIZE:
Res = WidenVecRes_Unary(N);		Res = WidenVecRes_Unary(N);
break;		break;
case ISD::FMA: case ISD::VP_FMA:		case ISD::FMA: case ISD::VP_FMA:
		case ISD::FCMA:
case ISD::FSHL:		case ISD::FSHL:
case ISD::VP_FSHL:		case ISD::VP_FSHL:
case ISD::FSHR:		case ISD::FSHR:
case ISD::VP_FSHR:		case ISD::VP_FSHR:
Res = WidenVecRes_Ternary(N);		Res = WidenVecRes_Ternary(N);
break;		break;
}		}

▲ Show 20 Lines • Show All 2,978 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,870 Lines • ▼ Show 20 Lines	bool SelectionDAG::isKnownNeverNaN(SDValue Op, bool SNaN, unsigned Depth) const {
case ISD::FADD:		case ISD::FADD:
case ISD::FSUB:		case ISD::FSUB:
case ISD::FMUL:		case ISD::FMUL:
case ISD::FDIV:		case ISD::FDIV:
case ISD::FREM:		case ISD::FREM:
case ISD::FSIN:		case ISD::FSIN:
case ISD::FCOS:		case ISD::FCOS:
case ISD::FMA:		case ISD::FMA:
		case ISD::FCMA:
case ISD::FMAD: {		case ISD::FMAD: {
if (SNaN)		if (SNaN)
return true;		return true;
// TODO: Need isKnownNeverInfinity		// TODO: Need isKnownNeverInfinity
return false;		return false;
}		}
case ISD::FCANONICALIZE:		case ISD::FCANONICALIZE:
case ISD::FEXP:		case ISD::FEXP:
▲ Show 20 Lines • Show All 1,812 Lines • ▼ Show 20 Lines	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
SDValue N1, SDValue N2, SDValue N3,		SDValue N1, SDValue N2, SDValue N3,
const SDNodeFlags Flags) {		const SDNodeFlags Flags) {
assert(N1.getOpcode() != ISD::DELETED_NODE &&		assert(N1.getOpcode() != ISD::DELETED_NODE &&
N2.getOpcode() != ISD::DELETED_NODE &&		N2.getOpcode() != ISD::DELETED_NODE &&
N3.getOpcode() != ISD::DELETED_NODE &&		N3.getOpcode() != ISD::DELETED_NODE &&
"Operand is DELETED_NODE!");		"Operand is DELETED_NODE!");
// Perform various simplifications.		// Perform various simplifications.
switch (Opcode) {		switch (Opcode) {
		case ISD::FCMA: {
		assert(VT.isFloatingPoint() && "This operator only applies to FP types!");
		assert(N1.getValueType() == VT && N2.getValueType() == VT &&
		N3.getValueType() == VT && "FCMA types must match!");
		// TODO : constant folding.
		break;
		}
case ISD::FMA: {		case ISD::FMA: {
assert(VT.isFloatingPoint() && "This operator only applies to FP types!");		assert(VT.isFloatingPoint() && "This operator only applies to FP types!");
assert(N1.getValueType() == VT && N2.getValueType() == VT &&		assert(N1.getValueType() == VT && N2.getValueType() == VT &&
N3.getValueType() == VT && "FMA types must match!");		N3.getValueType() == VT && "FMA types must match!");
ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);		ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);
ConstantFPSDNode *N2CFP = dyn_cast<ConstantFPSDNode>(N2);		ConstantFPSDNode *N2CFP = dyn_cast<ConstantFPSDNode>(N2);
ConstantFPSDNode *N3CFP = dyn_cast<ConstantFPSDNode>(N3);		ConstantFPSDNode *N3CFP = dyn_cast<ConstantFPSDNode>(N3);
if (N1CFP && N2CFP && N3CFP) {		if (N1CFP && N2CFP && N3CFP) {
▲ Show 20 Lines • Show All 5,681 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,484 Lines • ▼ Show 20 Lines	if (TM.Options.AllowFPOpFusion != FPOpFusion::Strict &&
getValue(I.getArgOperand(0)), getValue(I.getArgOperand(1)), Flags);		getValue(I.getArgOperand(0)), getValue(I.getArgOperand(1)), Flags);
SDValue Add = DAG.getNode(ISD::FADD, sdl,		SDValue Add = DAG.getNode(ISD::FADD, sdl,
getValue(I.getArgOperand(0)).getValueType(),		getValue(I.getArgOperand(0)).getValueType(),
Mul, getValue(I.getArgOperand(2)), Flags);		Mul, getValue(I.getArgOperand(2)), Flags);
setValue(&I, Add);		setValue(&I, Add);
}		}
return;		return;
}		}
		case Intrinsic::fcmuladd: {
		setValue(&I, DAG.getNode(ISD::FCMA, sdl,
		getValue(I.getArgOperand(0)).getValueType(),
		getValue(I.getArgOperand(0)),
		getValue(I.getArgOperand(1)),
		getValue(I.getArgOperand(2)), Flags));
		return;
		}
case Intrinsic::convert_to_fp16:		case Intrinsic::convert_to_fp16:
setValue(&I, DAG.getNode(ISD::BITCAST, sdl, MVT::i16,		setValue(&I, DAG.getNode(ISD::BITCAST, sdl, MVT::i16,
DAG.getNode(ISD::FP_ROUND, sdl, MVT::f16,		DAG.getNode(ISD::FP_ROUND, sdl, MVT::f16,
getValue(I.getArgOperand(0)),		getValue(I.getArgOperand(0)),
DAG.getTargetConstant(0, sdl,		DAG.getTargetConstant(0, sdl,
MVT::i32))));		MVT::i32))));
return;		return;
case Intrinsic::convert_from_fp16:		case Intrinsic::convert_from_fp16:
▲ Show 20 Lines • Show All 5,293 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp

Show First 20 Lines • Show All 262 Lines • ▼ Show 20 Lines	#endif
case ISD::STRICT_FADD: return "strict_fadd";		case ISD::STRICT_FADD: return "strict_fadd";
case ISD::FSUB: return "fsub";		case ISD::FSUB: return "fsub";
case ISD::STRICT_FSUB: return "strict_fsub";		case ISD::STRICT_FSUB: return "strict_fsub";
case ISD::FMUL: return "fmul";		case ISD::FMUL: return "fmul";
case ISD::STRICT_FMUL: return "strict_fmul";		case ISD::STRICT_FMUL: return "strict_fmul";
case ISD::FDIV: return "fdiv";		case ISD::FDIV: return "fdiv";
case ISD::STRICT_FDIV: return "strict_fdiv";		case ISD::STRICT_FDIV: return "strict_fdiv";
case ISD::FMA: return "fma";		case ISD::FMA: return "fma";
		case ISD::FCMA: return "fcma";
case ISD::STRICT_FMA: return "strict_fma";		case ISD::STRICT_FMA: return "strict_fma";
case ISD::FMAD: return "fmad";		case ISD::FMAD: return "fmad";
case ISD::FREM: return "frem";		case ISD::FREM: return "frem";
case ISD::STRICT_FREM: return "strict_frem";		case ISD::STRICT_FREM: return "strict_frem";
case ISD::FCOPYSIGN: return "fcopysign";		case ISD::FCOPYSIGN: return "fcopysign";
case ISD::FGETSIGN: return "fgetsign";		case ISD::FGETSIGN: return "fgetsign";
case ISD::FCANONICALIZE: return "fcanonicalize";		case ISD::FCANONICALIZE: return "fcanonicalize";
case ISD::IS_FPCLASS: return "is_fpclass";		case ISD::IS_FPCLASS: return "is_fpclass";
▲ Show 20 Lines • Show All 808 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,610 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForNEON(MVT VT) {
setOperationAction(ISD::ZERO_EXTEND_VECTOR_INREG, VT, Custom);		setOperationAction(ISD::ZERO_EXTEND_VECTOR_INREG, VT, Custom);
setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);		setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);		setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
setOperationAction(ISD::SRA, VT, Custom);		setOperationAction(ISD::SRA, VT, Custom);
setOperationAction(ISD::SRL, VT, Custom);		setOperationAction(ISD::SRL, VT, Custom);
setOperationAction(ISD::SHL, VT, Custom);		setOperationAction(ISD::SHL, VT, Custom);
setOperationAction(ISD::OR, VT, Custom);		setOperationAction(ISD::OR, VT, Custom);
setOperationAction(ISD::SETCC, VT, Custom);		setOperationAction(ISD::SETCC, VT, Custom);
		setOperationAction(ISD::FCMA, VT, Custom);
setOperationAction(ISD::CONCAT_VECTORS, VT, Legal);		setOperationAction(ISD::CONCAT_VECTORS, VT, Legal);

setOperationAction(ISD::SELECT, VT, Expand);		setOperationAction(ISD::SELECT, VT, Expand);
setOperationAction(ISD::SELECT_CC, VT, Expand);		setOperationAction(ISD::SELECT_CC, VT, Expand);
setOperationAction(ISD::VSELECT, VT, Expand);		setOperationAction(ISD::VSELECT, VT, Expand);
for (MVT InnerVT : MVT::all_valuetypes())		for (MVT InnerVT : MVT::all_valuetypes())
setLoadExtAction(ISD::EXTLOAD, InnerVT, VT, Expand);		setLoadExtAction(ISD::EXTLOAD, InnerVT, VT, Expand);

▲ Show 20 Lines • Show All 4,142 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
case ISD::FADD:		case ISD::FADD:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FADD_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FADD_PRED);
case ISD::FSUB:		case ISD::FSUB:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FSUB_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FSUB_PRED);
case ISD::FMUL:		case ISD::FMUL:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMUL_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMUL_PRED);
case ISD::FMA:		case ISD::FMA:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMA_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMA_PRED);
		case ISD::FCMA: {
		SDLoc dl(Op);
		SDValue vcmla0 =
		DAG.getTargetConstant(Intrinsic::aarch64_neon_vcmla_rot0, dl, MVT::i64);
		SDValue vcmla90 = DAG.getTargetConstant(Intrinsic::aarch64_neon_vcmla_rot90,
		dl, MVT::i64);
		SDValue Part1 =
		DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(), vcmla0,
		Op.getOperand(2), Op.getOperand(0), Op.getOperand(1));
		return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(), vcmla90,
		Part1, Op.getOperand(0), Op.getOperand(1));
		}
		NickGuyUnsubmitted Not Done Reply Inline Actions I may be missing it; But where do you handle targets that don't support complex instructions? I appreciate that nothing should be generating FCMAs yet, but if something matches a complex multiply-accumulate and emits one on a target that doesn't support it, the fcma should ideally be transformed to the equivalent fp operations (`shuffle(add(mul, mul), sub(mul, mul))`) Also, this should probably be moved to its own function, if for nothing else than to match the style of the rest of the switch. NickGuy: I may be missing it; But where do you handle targets that don't support complex instructions? I…

case ISD::FDIV:		case ISD::FDIV:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FDIV_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FDIV_PRED);
case ISD::FNEG:		case ISD::FNEG:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FNEG_MERGE_PASSTHRU);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FNEG_MERGE_PASSTHRU);
case ISD::FCEIL:		case ISD::FCEIL:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FCEIL_MERGE_PASSTHRU);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FCEIL_MERGE_PASSTHRU);
case ISD::FFLOOR:		case ISD::FFLOOR:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FFLOOR_MERGE_PASSTHRU);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FFLOOR_MERGE_PASSTHRU);
▲ Show 20 Lines • Show All 18,775 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/complex-intrinsics.ll

This file was added.

				; RUN: llc < %s -asm-verbose=false -mtriple=arm64-eabi -aarch64-neon-syntax=apple -mattr="+complxnum" \| FileCheck %s

				define <2 x float> @test_v2f32(<2 x float>* %A, <2 x float>* %B, <2 x float>* %C) nounwind {
				;CHECK-LABEL: test_v2f32:
				;CHECK: fcmla.2s{{.*}}#0
				;CHECK: fcmla.2s{{.*}}#90
				;CHECK-NOT: fcmla.2s
				%tmp1 = load <2 x float>, <2 x float>* %A
				%tmp2 = load <2 x float>, <2 x float>* %B
				%tmp3 = load <2 x float>, <2 x float>* %C
				%tmp4 = call <2 x float> @llvm.fcmuladd.v2f32(<2 x float> %tmp1, <2 x float> %tmp2, <2 x float> %tmp3)
				ret <2 x float> %tmp4
				}

				define <4 x float> @test_v4f32(<4 x float> %A, <4 x float> %B, <4 x float> %C) nounwind {
				;CHECK-LABEL: test_v4f32:
				;CHECK: fcmla.4s{{.*}}#0
				;CHECK: fcmla.4s{{.*}}#90
				;CHECK-NOT: fcmla.4s
				%tmp4 = call <4 x float> @llvm.fcmuladd.v4f32(<4 x float> %A, <4 x float> %B, <4 x float> %C)
				ret <4 x float> %tmp4
				}

				define <8 x float> @test_v8f32(<8 x float>* %A, <8 x float>* %B, <8 x float>* %C) nounwind {
				;CHECK-LABEL: test_v8f32:
				;CHECK: fcmla.4s{{.*}}#0
				;CHECK: fcmla.4s{{.*}}#0
				;CHECK: fcmla.4s{{.*}}#90
				;CHECK: fcmla.4s{{.*}}#90
				;CHECK-NOT: fcmla.4s
				%tmp1 = load <8 x float>, <8 x float>* %A
				%tmp2 = load <8 x float>, <8 x float>* %B
				%tmp3 = load <8 x float>, <8 x float>* %C
				%tmp4 = call <8 x float> @llvm.fcmuladd.v8f32(<8 x float> %tmp1, <8 x float> %tmp2, <8 x float> %tmp3)
				ret <8 x float> %tmp4
				}

				define <2 x double> @test_v2f64(<2 x double>* %A, <2 x double>* %B, <2 x double>* %C) nounwind {
				;CHECK-LABEL: test_v2f64:
				;CHECK: fcmla.2d{{.*}}#0
				;CHECK: fcmla.2d{{.*}}#90
				;CHECK-NOT: fcmla.2d
				%tmp1 = load <2 x double>, <2 x double>* %A
				%tmp2 = load <2 x double>, <2 x double>* %B
				%tmp3 = load <2 x double>, <2 x double>* %C
				%tmp4 = call <2 x double> @llvm.fcmuladd.v2f64(<2 x double> %tmp1, <2 x double> %tmp2, <2 x double> %tmp3)
				ret <2 x double> %tmp4
				}

				define <4 x double> @test_v4f64(<4 x double>* %A, <4 x double>* %B, <4 x double>* %C) nounwind {
				;CHECK-LABEL: test_v4f64:
				;CHECK: fcmla.2d{{.*}}#0
				;CHECK: fcmla.2d{{.*}}#0
				;CHECK: fcmla.2d{{.*}}#90
				;CHECK: fcmla.2d{{.*}}#90
				;CHECK-NOT: fcmla.2d
				%tmp1 = load <4 x double>, <4 x double>* %A
				%tmp2 = load <4 x double>, <4 x double>* %B
				%tmp3 = load <4 x double>, <4 x double>* %C
				%tmp4 = call <4 x double> @llvm.fcmuladd.v4f64(<4 x double> %tmp1, <4 x double> %tmp2, <4 x double> %tmp3)
				ret <4 x double> %tmp4
				}
				declare <2 x float> @llvm.fcmuladd.v2f32(<2 x float>, <2 x float>, <2 x float>) nounwind readnone
				declare <4 x float> @llvm.fcmuladd.v4f32(<4 x float>, <4 x float>, <4 x float>) nounwind readnone
				declare <8 x float> @llvm.fcmuladd.v8f32(<8 x float>, <8 x float>, <8 x float>) nounwind readnone
				declare <2 x double> @llvm.fcmuladd.v2f64(<2 x double>, <2 x double>, <2 x double>) nounwind readnone
				declare <4 x double> @llvm.fcmuladd.v4f64(<4 x double>, <4 x double>, <4 x double>) nounwind readnone

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower fused complex multiply-add intrinsic to AArch64::FCMANeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 512654

llvm/include/llvm/CodeGen/ISDOpcodes.h

llvm/include/llvm/IR/Intrinsics.td

llvm/include/llvm/Target/TargetSelectionDAG.td

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/complex-intrinsics.ll

[AArch64] Lower fused complex multiply-add intrinsic to AArch64::FCMA
Needs ReviewPublic