This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
5/5
AArch64ISelLowering.cpp
1/1
AArch64InstrInfo.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
aarch64-pmull2.ll
-
pmull-ldr-merge.ll

Differential D131047

[AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new SDNode.
ClosedPublic

Authored by mingmingl on Aug 2 2022, 11:25 PM.

Download Raw Diff

Details

Reviewers

efriedma
dmgreen

Commits

rG945a3065015a: [AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new

Summary

How:

Add AArch64ISD::PMULL SDNode, and extend aarch64_neon_pmull intrinsic tablegen pattern for this SDNode.
For aarch64_neon_pmull64, canonicalize i64 operands to v1i64 vectors during legalization.
For {aarch64_neon_pmull, aarch64_neon_pmull64}, combine intrinsic to SDNode. in dag-combiner

Why

aarch64_neon_pmull64 is the motivating use case. Adding the SDNode makes it easier to canonicalize i64 inputs to vector inputs. Vector inputs carries lane information, which helps dag-combiner to combine nodes (e.g. rewrite to a better node to prepare for instruction selection); as a result, instruction-selection could emit instructions that use higher-half inputs in place (i.e., no need to move lane 1 content to lane 0).
For aarch64_neon_pmull, using the SDNode is NFC, yet without this we have to move the definition of {PMULLv1i64, PMULLv2i64} out of its current group of records without gains.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mingmingl created this revision.Aug 2 2022, 11:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 11:25 PM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

mingmingl requested review of this revision.Aug 2 2022, 11:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 11:25 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B178950: Diff 449552.Aug 2 2022, 11:26 PM

mingmingl edited the summary of this revision. (Show Details)Aug 2 2022, 11:27 PM

mingmingl added reviewers: efriedma, dmgreen.

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

mingmingl mentioned this in D130100: [AArch64] Combine a load into GPR followed by a copy to FPR to a load into FPR directly through MIPeepholeOpt.Aug 3 2022, 1:53 PM

In D131047#3696069, @dmgreen wrote:

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

Thanks for pointing out tryCombineLongOpWithDup (wasn't aware of this approach before); it looks reasonable to me to follow the dag-combiner path. Plan changes.

Updated the tablegen pattern to generate DUPv2i64gpr directly, with the motivation in inline replies.

In D131047#3698635, @mingmingl wrote:

In D131047#3696069, @dmgreen wrote:

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

Thanks for pointing out tryCombineLongOpWithDup (wasn't aware of this approach before); it looks reasonable to me to follow the dag-combiner path. Plan changes.

After a study (details elaborated below), update the tablegen pattern to use DUPv2i64gpr to dup from GPR directly.

Some contexts on why updating tablegen pattern:

Ultimately, we want dup V128, GPR64 as opposed to fmov V64 GPR (or dup V64 GPR) followed by a duplane64 V128, 0; so tablegen pattern use DUPv2i64gpr (dup from main to SIMD register) to express the intention.
Alternatively, this two-step approach works
- First step is to transform GPR64 to extractelt (duplane64 (scalar_to_vector V128, GPR64), 0), 1 (in AArch64ISelLowering.cpp) so pmull2 is used, and second step is to transform (duplane64 (scalar_to_vector V128, GPR64), 0) to dup V128, GPR (in AArch64InstrInfo.td)
- https://reviews.llvm.org/differential/diff/452482/ shows two-step approach, with {cpp, td, test} changes. Note the tests are updated in the same way as this patch.
Both this patch and the alternative option in 2, retain duplane64 throughout the legalize/combine steps, until instruction selection really begins. This is because extactelt(dup x) will be combined into x (code where this happens)
tryCombineLongOpWithDup is suitable when operands are vector, but intrinsic llvm.aarch64.neon.pmull64 has scalar operands.
- Vectorizing a scalar operand during dag-combiner is not preferred here, since new operands of a combined node won't be appended to worklist for further combination (only the combined node and its users will be added to the worklist).
- Even if new operands could be added to worklist for further combination opportunities, we want to retain duplane64 (rather than combining it into dup) as mentioned in 3.

(Sorry for the delay; I did some study for the findings and then took a one-week vacation)

Harbormaster completed remote builds in B181138: Diff 452482.Aug 14 2022, 1:22 AM

Decide to go with a tablegen pattern for dup, and vectorization for aarch64.neon.pmumll64 intrinisc.

The alternative is a tablegen pattern for instrinsic [1]. This alternative is sub-optimal since it create new nodes (to dup from GPR) in the final instruction stage, and missing chances of combination in dag-combiners.
- For example, if i64 is the lower-half of SIMD registers, we want to dup from lane, directly rather than generating a fmov (from SIMD lane 0 to GPR) followed by a dup (from GPR to all lanes of SIMD). test2 in aarch64-pmull2 is the corresponding test for this.

[1]

def : Pat<(int_aarch64_neon_pmull64 (extractelt (v2i64 V128:$Rn), (i64 1)),
                                    GPR64:$Rm),
          (PMULLv2i64 V128:$Rn, (v2i64 (DUPv2i64gpr GPR64:$Rm)))>;

Harbormaster completed remote builds in B181193: Diff 452557.Aug 14 2022, 12:50 PM

Can you upload with full context?

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future. It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

In D131047#3723365, @dmgreen wrote:

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Re aarch64_neon_pmull64 relies on i64 inputs, yeah this means [1] won't compile, and [2] will require a v1i64 -> i64 change, which is not desired (details in [3])

[1]

// with "Type set is empty for each HW mode" error from tablegen 
def : Pat<(int_aarch64_neon_pmull64 (v1i64 (extract_subvector V128:$Rn, (i64 1))),
                                    (v1i64 (extract_subvector V128:$Rm, (i64 1)))),
                (PMULLv2i64 V128:$Rn, V128:$Rm)>

[2]

def : Pat<(int_aarch64_neon_pmull64 (i64 (bitconvert (v1i64 (extract_subvector V128:$Rn, (i64 1))))),
                                    (i64 (bitconvert (v1i64 (extract_subvector V128:$Rm, (i64 1)))))),
          (PMULLv2i64 V128:$Rn, V128:$Rm)>;

[3]
First, v1i64->i64 reverses i64->v1i64, so having both of them either gives indefinite "combine->legalization->combine" loop ( nodes created by dag-combiner will be re-legalized ) , or makes the steps super stateful and complex.
Secondly, PreprocessISelDAG, the per-target hook right before ISel begins, isn't implemented in aarch64 backend. And implementing it (for this use case) comes with a non-eligible cost to iterate all instructions.

In D131047#3725122, @mingmingl wrote:

In D131047#3723365, @dmgreen wrote:

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

I meant update the patch with -U9999999 as per https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface. It makes the reviews easier to read.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Yeah, a AArch64ISD::PMULL64 node. It can hopefully make the intent that the inputs are i64's in vector registers easier to specify.

Add SDNode pmull and use it for both aarch64_neon_pmull and aarch64_neon_pmull64.

In D131047#3725210, @dmgreen wrote:

In D131047#3725122, @mingmingl wrote:

In D131047#3723365, @dmgreen wrote:

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

I meant update the patch with -U9999999 as per https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface. It makes the reviews easier to read.

Got it. Used "git show HEAD -U999999 > mypatch.patch" this time.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Yeah, a AArch64ISD::PMULL64 node. It can hopefully make the intent that the inputs are i64's in vector registers easier to specify.

Done with two things to mention

While only one operand is a higher-half, the other operand is legalized with DUP.
- If the other operand is a lower-half, DUPLANE64 is better. Use a FIXME assuming aarch64_neon_pmull64(higher-half, lower-half) is not common. This is feasible to fix (using Optional<uint64_t> to represent lane1, lane0 or not SIMD) but just add a little more code complexity.
It seems to me that ld1r requires the base address to be a GPR so address with offset becomes another add instruction; while with ldr (small) offsets could be folded into instruction itself. So use ISD::SCALAR_TO_VECTOR for i64->v1i64 rather than ISD::AArch64Dup.

Let me know if I miss anything in 1) or 2). Thanks!

Harbormaster completed remote builds in B181692: Diff 453214.Aug 17 2022, 12:56 AM

Update after precommit tests are simplified so that diff is clearer.

Harbormaster completed remote builds in B181698: Diff 453229.Aug 17 2022, 1:18 AM

dmgreen added inline comments.Aug 17 2022, 9:47 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4580–4584	Return a AArch64ISD::PMULL here...
4588	Will the type ever not be a i64?
16644–16645	.. then I don't think this is needed, we have already "lowered" pmull64.
llvm/lib/Target/AArch64/AArch64InstrInfo.td
673	Add a newline like the others below to keep the line-length down. I don't think there is a strict line length in these files, but we try to keep the lines getting too long.

Update include:

address comments
took the liberty to detect both lane 1 and lane 0 (previously only lane 1) so duplane64 (as opposed to dup-from-gpr) is used for lane 0, as a small optimization.

Also at this point, test4 in pmull-ldr-merge.ll is not about ldr, so aarch64-pmull.ll is a better place. I could send an NFC change later , so as not to frequently updating the precommit test patch ( an NFC update to precommit test is not obvious in the UI)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4588	By lowering `aarch64_neon_pmull64` to `AArch64ISD::PMULL` around line 4583 below (as suggested), the type would always be i64. So added an assert -> without `intrinsic->AArch64ISD::PMULL` lowering, the new intrinsic SDNode will be added to node list of the DAG (code) for legalization; by then the type is not i64.

Harbormaster completed remote builds in B181916: Diff 453543.Aug 18 2022, 12:04 AM

Changes

Fix a subtle C++ bug in static lambda TryVectorizeOperand-> the helper function is declared as lambda to limit scope (no need to sanity check parameters), but it should really not capture variables (that could change per invocation). This issue just occurred to me when looking at the codebase.
In tablegen pattern SIMDDifferentThreeVectorBD, remove default parameter (null) for OpNode since there isn't a use case for default parameter now.

Harbormaster completed remote builds in B182096: Diff 453798.Aug 18 2022, 2:52 PM

Thanks. LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4558	Remove static from this variable. I'm pretty sure it only really applies to the TryVectorizeOperand variable, not anything to do with the lambda, so needn't be static.

This revision is now accepted and ready to land.Aug 19 2022, 2:06 AM

mingmingl mentioned this in D131045: [NFC][AArch64] Precommit test to optimize instruction selection for aarch64_neon_pmull64 intrinsic..Aug 19 2022, 10:14 AM

Resolve comments.

In D131047#3734573, @dmgreen wrote:

Thanks. LGTM

Thanks for consistently steering the patch in the right direction!

I learnt I should have asked for review of D131045 in the first place when it came to commit time. It's better late than never, so did that just now.

Harbormaster completed remote builds in B182262: Diff 454056.Aug 19 2022, 10:32 AM

This revision was landed with ongoing or failed builds.Aug 19 2022, 1:18 PM

Closed by commit rG945a3065015a: [AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new (authored by mingmingl). · Explain Why

This revision was automatically updated to reflect the committed changes.

mingmingl added a commit: rG945a3065015a: [AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new.

Allen mentioned this in D140649: [AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcount.Dec 24 2022, 12:32 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

53 lines

AArch64InstrInfo.td

4 lines

test/

CodeGen/

AArch64/

aarch64-pmull2.ll

31 lines

pmull-ldr-merge.ll

21 lines

Diff 449552

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,180 Lines • ▼ Show 20 Lines	if (OrigTy.getSizeInBits() >= 64)
return N;		return N;

// Must extend size to at least 64 bits to be used as an operand for VMULL.		// Must extend size to at least 64 bits to be used as an operand for VMULL.
EVT NewVT = getExtensionTo64Bits(OrigTy);		EVT NewVT = getExtensionTo64Bits(OrigTy);

return DAG.getNode(ExtOpcode, SDLoc(N), NewVT, N);		return DAG.getNode(ExtOpcode, SDLoc(N), NewVT, N);
}		}

static bool isOperandOfHigherHalf(SDValue &Op) {		// Returns true if Op means to extract higher half of vector elements.
		static bool isOperandOfExtractHigherHalf(SDValue &Op) {
SDNode *OpNode = Op.getNode();		SDNode *OpNode = Op.getNode();
if (OpNode->getOpcode() != ISD::EXTRACT_VECTOR_ELT)		if (OpNode->getOpcode() != ISD::EXTRACT_VECTOR_ELT)
return false;		return false;

ConstantSDNode *C = dyn_cast<ConstantSDNode>(OpNode->getOperand(1));		ConstantSDNode *C = dyn_cast<ConstantSDNode>(OpNode->getOperand(1));
if (!C \|\| C->getZExtValue() != 1)		if (!C \|\| C->getZExtValue() != 1)
return false;		return false;

EVT VT = OpNode->getOperand(0).getValueType();		EVT VT = OpNode->getOperand(0).getValueType();

return VT.isFixedLengthVector() && VT.getVectorNumElements() == 2;		return VT.isFixedLengthVector() && VT.getVectorNumElements() == 2;
}		}

static bool areOperandsOfHigherHalf(SDValue &Op1, SDValue &Op2) {
return isOperandOfHigherHalf(Op1) && isOperandOfHigherHalf(Op2);
}

static bool isExtendedBUILD_VECTOR(SDNode *N, SelectionDAG &DAG,		static bool isExtendedBUILD_VECTOR(SDNode *N, SelectionDAG &DAG,
bool isSigned) {		bool isSigned) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (N->getOpcode() != ISD::BUILD_VECTOR)		if (N->getOpcode() != ISD::BUILD_VECTOR)
return false;		return false;

for (const SDValue &Elt : N->op_values()) {		for (const SDValue &Elt : N->op_values()) {
▲ Show 20 Lines • Show All 325 Lines • ▼ Show 20 Lines	if (Ty == MVT::i64) {
return DAG.getNode(ISD::BITCAST, dl, MVT::i64, Result);		return DAG.getNode(ISD::BITCAST, dl, MVT::i64, Result);
} else if (Ty.isVector() && Ty.isInteger() && isTypeLegal(Ty)) {		} else if (Ty.isVector() && Ty.isInteger() && isTypeLegal(Ty)) {
return DAG.getNode(ISD::ABS, dl, Ty, Op.getOperand(1));		return DAG.getNode(ISD::ABS, dl, Ty, Op.getOperand(1));
} else {		} else {
report_fatal_error("Unexpected type for AArch64 NEON intrinic");		report_fatal_error("Unexpected type for AArch64 NEON intrinic");
}		}
}		}
case Intrinsic::aarch64_neon_pmull64: {		case Intrinsic::aarch64_neon_pmull64: {
SDValue Op1 = Op.getOperand(1);		SDValue LHS = Op.getOperand(1);
SDValue Op2 = Op.getOperand(2);		SDValue RHS = Op.getOperand(2);

		bool isLHSHigherHalf = isOperandOfExtractHigherHalf(LHS);
		bool isRHSHigherHalf = isOperandOfExtractHigherHalf(RHS);

// If both operands are higher half of two source SIMD & FP registers,		// When both operands are higher half of source registers, ISel could make
// ISel could make use of tablegen patterns to emit PMULL2. So do not		// use of the following pattern to use PMULL2 directly.
// legalize i64 to v1i64.		//
if (areOperandsOfHigherHalf(Op1, Op2))		// def : Pat<(int_aarch64_neon_pmull64
		// (extractelt (v2i64 V128:$Rn), (i64 1)),
		// (extractelt (v2i64 V128:$Rm), (i64 1))),
		// (PMULLv2i64 V128:$Rn, V128:$Rm)>;
		if (isLHSHigherHalf && isRHSHigherHalf)
		dmgreenUnsubmitted Done Reply Inline Actions Remove static from this variable. I'm pretty sure it only really applies to the TryVectorizeOperand variable, not anything to do with the lambda, so needn't be static. dmgreen: Remove static from this variable. I'm pretty sure it only really applies to the…
return SDValue();		return SDValue();

		// Intrinsic aarch64_neon_pmull64 is communative.
		// If there is exactly one operand that extracts higher half of the vector,
		// canonicalize it to the left.
		if (isRHSHigherHalf && !isLHSHigherHalf) {
		std::swap(LHS, RHS);
		std::swap(isLHSHigherHalf, isRHSHigherHalf);
		}

// As a general convention, use "v1" types to represent scalar integer		// As a general convention, use "v1" types to represent scalar integer
// operations in vector registers. This helps ISel to make use of		// operations. This helps ISel to generate a load into SIMD & FP registers
// tablegen patterns and generate a load into SIMD & FP registers directly.		// directly rather than load into GPR register followed by a mov.
if (Op1.getValueType() == MVT::i64)		//
Op1 = DAG.getNode(ISD::BITCAST, dl, MVT::v1i64, Op1);		// If the operand is an extract of higher half, use the higher half register
if (Op2.getValueType() == MVT::i64)		// as it is (i.e., not moving higher half to lower half of other registers).
Op2 = DAG.getNode(ISD::BITCAST, dl, MVT::v1i64, Op2);		if (LHS.getValueType() == MVT::i64 && !isLHSHigherHalf)
		LHS = DAG.getNode(ISD::BITCAST, dl, MVT::v1i64, LHS);
		if (RHS.getValueType() == MVT::i64 && !isRHSHigherHalf)
		RHS = DAG.getNode(ISD::BITCAST, dl, MVT::v1i64, RHS);

return DAG.getNode(		return DAG.getNode(
ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(),		ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(),
DAG.getConstant(Intrinsic::aarch64_neon_pmull64, dl, MVT::i32), Op1,		DAG.getConstant(Intrinsic::aarch64_neon_pmull64, dl, MVT::i32), LHS,
Op2);		RHS);
}		}
		dmgreenUnsubmitted Done Reply Inline Actions Return a AArch64ISD::PMULL here... dmgreen: Return a AArch64ISD::PMULL here...
case Intrinsic::aarch64_neon_smax:		case Intrinsic::aarch64_neon_smax:
return DAG.getNode(ISD::SMAX, dl, Op.getValueType(),		return DAG.getNode(ISD::SMAX, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
case Intrinsic::aarch64_neon_umax:		case Intrinsic::aarch64_neon_umax:
		dmgreenUnsubmitted Done Reply Inline Actions Will the type ever not be a i64? dmgreen: Will the type ever not be a i64?
		mingminglAuthorUnsubmitted Done Reply Inline Actions By lowering `aarch64_neon_pmull64` to `AArch64ISD::PMULL` around line 4583 below (as suggested), the type would always be i64. So added an assert -> without `intrinsic->AArch64ISD::PMULL` lowering, the new intrinsic SDNode will be added to node list of the DAG (code) for legalization; by then the type is not i64. mingmingl: By lowering `aarch64_neon_pmull64` to `AArch64ISD::PMULL` around line 4583 below (as suggested)…
return DAG.getNode(ISD::UMAX, dl, Op.getValueType(),		return DAG.getNode(ISD::UMAX, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
case Intrinsic::aarch64_neon_smin:		case Intrinsic::aarch64_neon_smin:
return DAG.getNode(ISD::SMIN, dl, Op.getValueType(),		return DAG.getNode(ISD::SMIN, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
case Intrinsic::aarch64_neon_umin:		case Intrinsic::aarch64_neon_umin:
return DAG.getNode(ISD::UMIN, dl, Op.getValueType(),		return DAG.getNode(ISD::UMIN, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
▲ Show 20 Lines • Show All 12,039 Lines • ▼ Show 20 Lines	case Intrinsic::aarch64_neon_smull:
return DAG.getNode(AArch64ISD::SMULL, SDLoc(N), N->getValueType(0),		return DAG.getNode(AArch64ISD::SMULL, SDLoc(N), N->getValueType(0),
N->getOperand(1), N->getOperand(2));		N->getOperand(1), N->getOperand(2));
case Intrinsic::aarch64_neon_umull:		case Intrinsic::aarch64_neon_umull:
return DAG.getNode(AArch64ISD::UMULL, SDLoc(N), N->getValueType(0),		return DAG.getNode(AArch64ISD::UMULL, SDLoc(N), N->getValueType(0),
N->getOperand(1), N->getOperand(2));		N->getOperand(1), N->getOperand(2));
case Intrinsic::aarch64_neon_pmull:		case Intrinsic::aarch64_neon_pmull:
case Intrinsic::aarch64_neon_sqdmull:		case Intrinsic::aarch64_neon_sqdmull:
return tryCombineLongOpWithDup(IID, N, DCI, DAG);		return tryCombineLongOpWithDup(IID, N, DCI, DAG);
case Intrinsic::aarch64_neon_sqshl:		case Intrinsic::aarch64_neon_sqshl:
case Intrinsic::aarch64_neon_uqshl:		case Intrinsic::aarch64_neon_uqshl:
		dmgreenUnsubmitted Done Reply Inline Actions .. then I don't think this is needed, we have already "lowered" pmull64. dmgreen: .. then I don't think this is needed, we have already "lowered" pmull64.
case Intrinsic::aarch64_neon_sqshlu:		case Intrinsic::aarch64_neon_sqshlu:
case Intrinsic::aarch64_neon_srshl:		case Intrinsic::aarch64_neon_srshl:
case Intrinsic::aarch64_neon_urshl:		case Intrinsic::aarch64_neon_urshl:
case Intrinsic::aarch64_neon_sshl:		case Intrinsic::aarch64_neon_sshl:
case Intrinsic::aarch64_neon_ushl:		case Intrinsic::aarch64_neon_ushl:
return tryCombineShiftImm(IID, N, DAG);		return tryCombineShiftImm(IID, N, DAG);
case Intrinsic::aarch64_crc32b:		case Intrinsic::aarch64_crc32b:
case Intrinsic::aarch64_crc32cb:		case Intrinsic::aarch64_crc32cb:
▲ Show 20 Lines • Show All 5,433 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 664 Lines • ▼ Show 20 Lines

	def AArch64WrapperLarge : SDNode<"AArch64ISD::WrapperLarge",			def AArch64WrapperLarge : SDNode<"AArch64ISD::WrapperLarge",
	SDT_AArch64WrapperLarge>;			SDT_AArch64WrapperLarge>;

	def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;			def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;

	def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,			def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,
	SDTCisSameAs<1, 2>]>;			SDTCisSameAs<1, 2>]>;
	def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull,			def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull,
				dmgreenUnsubmitted Done Reply Inline Actions Add a newline like the others below to keep the line-length down. I don't think there is a strict line length in these files, but we try to keep the lines getting too long. dmgreen: Add a newline like the others below to keep the line-length down. I don't think there is a…
	[SDNPCommutative]>;			[SDNPCommutative]>;
	def AArch64umull : SDNode<"AArch64ISD::UMULL", SDT_AArch64mull,			def AArch64umull : SDNode<"AArch64ISD::UMULL", SDT_AArch64mull,
	[SDNPCommutative]>;			[SDNPCommutative]>;

	def AArch64frecpe : SDNode<"AArch64ISD::FRECPE", SDTFPUnaryOp>;			def AArch64frecpe : SDNode<"AArch64ISD::FRECPE", SDTFPUnaryOp>;
	def AArch64frecps : SDNode<"AArch64ISD::FRECPS", SDTFPBinOp>;			def AArch64frecps : SDNode<"AArch64ISD::FRECPS", SDTFPBinOp>;
	def AArch64frsqrte : SDNode<"AArch64ISD::FRSQRTE", SDTFPUnaryOp>;			def AArch64frsqrte : SDNode<"AArch64ISD::FRSQRTE", SDTFPUnaryOp>;
	def AArch64frsqrts : SDNode<"AArch64ISD::FRSQRTS", SDTFPBinOp>;			def AArch64frsqrts : SDNode<"AArch64ISD::FRSQRTS", SDTFPBinOp>;
	▲ Show 20 Lines • Show All 4,859 Lines • ▼ Show 20 Lines
	def DUPv2i64lane : SIMDDup64FromElement;			def DUPv2i64lane : SIMDDup64FromElement;
	def DUPv2i32lane : SIMDDup32FromElement<0, ".2s", v2i32, V64>;			def DUPv2i32lane : SIMDDup32FromElement<0, ".2s", v2i32, V64>;
	def DUPv4i32lane : SIMDDup32FromElement<1, ".4s", v4i32, V128>;			def DUPv4i32lane : SIMDDup32FromElement<1, ".4s", v4i32, V128>;
	def DUPv4i16lane : SIMDDup16FromElement<0, ".4h", v4i16, V64>;			def DUPv4i16lane : SIMDDup16FromElement<0, ".4h", v4i16, V64>;
	def DUPv8i16lane : SIMDDup16FromElement<1, ".8h", v8i16, V128>;			def DUPv8i16lane : SIMDDup16FromElement<1, ".8h", v8i16, V128>;
	def DUPv8i8lane : SIMDDup8FromElement <0, ".8b", v8i8, V64>;			def DUPv8i8lane : SIMDDup8FromElement <0, ".8b", v8i8, V64>;
	def DUPv16i8lane : SIMDDup8FromElement <1, ".16b", v16i8, V128>;			def DUPv16i8lane : SIMDDup8FromElement <1, ".16b", v16i8, V128>;

				def : Pat<(int_aarch64_neon_pmull64 (extractelt (v2i64 V128:$Rn), (i64 1)),
				V64:$Rm),
				(PMULLv2i64 V128:$Rn, (v2f64 (DUPv2i64lane (INSERT_SUBREG (v2i64 (IMPLICIT_DEF)), V64:$Rm, dsub), (i64 0))))>;

	// DUP from a 64-bit register to a 64-bit register is just a copy			// DUP from a 64-bit register to a 64-bit register is just a copy
	def : Pat<(v1i64 (AArch64dup (i64 GPR64:$Rn))),			def : Pat<(v1i64 (AArch64dup (i64 GPR64:$Rn))),
	(COPY_TO_REGCLASS GPR64:$Rn, FPR64)>;			(COPY_TO_REGCLASS GPR64:$Rn, FPR64)>;
	def : Pat<(v1f64 (AArch64dup (f64 FPR64:$Rn))),			def : Pat<(v1f64 (AArch64dup (f64 FPR64:$Rn))),
	(COPY_TO_REGCLASS FPR64:$Rn, FPR64)>;			(COPY_TO_REGCLASS FPR64:$Rn, FPR64)>;

	def : Pat<(v2f32 (AArch64dup (f32 FPR32:$Rn))),			def : Pat<(v2f32 (AArch64dup (f32 FPR32:$Rn))),
	(v2f32 (DUPv2i32lane			(v2f32 (DUPv2i32lane
	▲ Show 20 Lines • Show All 2,851 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-pmull2.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -verify-machineinstrs -mtriple=aarch64-linux-gnu -mattr=+aes -o - %s\| FileCheck %s --check-prefixes=CHECK			; RUN: llc -verify-machineinstrs -mtriple=aarch64-linux-gnu -mattr=+aes -o - %s\| FileCheck %s --check-prefixes=CHECK

	; Test that PMULL2 are codegen'ed when only one (of two) operands			; Test that PMULL2 are codegen'ed when only one (of two) operands
	; are in higher-half register already.			; are in higher-half register already.
	;			;
	; This is a big win (saves multiple moves) since user code intends to execute {pmull, pmull2} instruction			; This is a big win (saves multiple moves) since user code intends to execute {pmull, pmull2} instruction
	; on {lower, higher} half of the same SIMD register.			; on {lower, higher} half of the same SIMD register.
	define void @test(ptr %0, ptr %1) {			define void @test(ptr %0, ptr %1) {
	; CHECK-LABEL: test:			; CHECK-LABEL: test:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ldp q0, q1, [x1]
	; CHECK-NEXT: mov w8, #56824			; CHECK-NEXT: mov w8, #56824
	; CHECK-NEXT: mov w9, #61186			; CHECK-NEXT: mov w9, #61186
	; CHECK-NEXT: movk w8, #40522, lsl #16			; CHECK-NEXT: movk w8, #40522, lsl #16
	; CHECK-NEXT: movk w9, #29710, lsl #16			; CHECK-NEXT: movk w9, #29710, lsl #16
	; CHECK-NEXT: mov x10, v0.d[1]			; CHECK-NEXT: ldp q2, q3, [x1]
	; CHECK-NEXT: fmov d2, x9			; CHECK-NEXT: fmov d0, x8
	; CHECK-NEXT: mov x11, v1.d[1]			; CHECK-NEXT: fmov d1, x9
	; CHECK-NEXT: fmov d3, x8			; CHECK-NEXT: dup v0.2d, v0.d[0]
	; CHECK-NEXT: fmov d4, x10			; CHECK-NEXT: pmull v4.1q, v2.1d, v1.1d
	; CHECK-NEXT: pmull v0.1q, v0.1d, v2.1d			; CHECK-NEXT: pmull v1.1q, v3.1d, v1.1d
	; CHECK-NEXT: fmov d5, x11			; CHECK-NEXT: pmull2 v2.1q, v2.2d, v0.2d
	; CHECK-NEXT: pmull v1.1q, v1.1d, v2.1d			; CHECK-NEXT: pmull2 v0.1q, v3.2d, v0.2d
	; CHECK-NEXT: pmull v2.1q, v4.1d, v3.1d			; CHECK-NEXT: ldp q3, q5, [x0]
	; CHECK-NEXT: pmull v3.1q, v5.1d, v3.1d			; CHECK-NEXT: eor v2.16b, v4.16b, v2.16b
	; CHECK-NEXT: ldp q4, q5, [x0]			; CHECK-NEXT: eor v0.16b, v1.16b, v0.16b
	; CHECK-NEXT: eor v0.16b, v0.16b, v2.16b			; CHECK-NEXT: eor v1.16b, v3.16b, v2.16b
	; CHECK-NEXT: eor v1.16b, v1.16b, v3.16b			; CHECK-NEXT: eor v0.16b, v5.16b, v0.16b
	; CHECK-NEXT: eor v0.16b, v4.16b, v0.16b			; CHECK-NEXT: stp q1, q0, [x1]
	; CHECK-NEXT: eor v1.16b, v5.16b, v1.16b
	; CHECK-NEXT: stp q0, q1, [x1]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%3 = load <2 x i64>, ptr %1			%3 = load <2 x i64>, ptr %1
	%4 = getelementptr inbounds <2 x i64>, ptr %1, i64 1			%4 = getelementptr inbounds <2 x i64>, ptr %1, i64 1
	%5 = load <2 x i64>, ptr %4			%5 = load <2 x i64>, ptr %4
	%6 = extractelement <2 x i64> %3, i64 1			%6 = extractelement <2 x i64> %3, i64 1
	%7 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %6, i64 2655706616)			%7 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %6, i64 2655706616)
	%8 = extractelement <2 x i64> %5, i64 1			%8 = extractelement <2 x i64> %5, i64 1
	%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %8, i64 2655706616)			%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %8, i64 2655706616)
	Show All 19 Lines

llvm/test/CodeGen/AArch64/pmull-ldr-merge.ll

Show All 21 Lines	; CHECK-NEXT: ret
%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %6, i64 %8)		%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %6, i64 %8)
store <16 x i8> %9, ptr %4, align 16		store <16 x i8> %9, ptr %4, align 16
ret void		ret void
}		}

define void @test2(ptr %0, i64 %1, i64 %2, <2 x i64> %3) {		define void @test2(ptr %0, i64 %1, i64 %2, <2 x i64> %3) {
; CHECK-LABEL: test2:		; CHECK-LABEL: test2:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov x9, v0.d[1]
; CHECK-NEXT: add x8, x0, x1, lsl #4		; CHECK-NEXT: add x8, x0, x1, lsl #4
; CHECK-NEXT: ldr d0, [x8, #8]		; CHECK-NEXT: ldr d1, [x8, #8]
; CHECK-NEXT: fmov d1, x9		; CHECK-NEXT: dup v1.2d, v1.d[0]
; CHECK-NEXT: pmull v0.1q, v1.1d, v0.1d		; CHECK-NEXT: pmull2 v0.1q, v0.2d, v1.2d
; CHECK-NEXT: str q0, [x8]		; CHECK-NEXT: str q0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%5 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1		%5 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1
%6 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1, i64 1		%6 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1, i64 1
%7 = load i64, ptr %6, align 8		%7 = load i64, ptr %6, align 8
%8 = extractelement <2 x i64> %3, i64 1		%8 = extractelement <2 x i64> %3, i64 1
%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %8, i64 %7)		%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %8, i64 %7)
store <16 x i8> %9, ptr %5, align 16		store <16 x i8> %9, ptr %5, align 16
ret void		ret void
}		}

; test3 clones test2, but swaps lhs with rhs, to test that non-extract		; test3 clones test2, but swaps lhs with rhs, to test that non-extract
; operand will be canonicalized to the rhs.		; operand will be canonicalized to the rhs.
define void @test3(ptr %0, i64 %1, i64 %2, <2 x i64> %3) {		define void @test3(ptr %0, i64 %1, i64 %2, <2 x i64> %3) {
; CHECK-LABEL: test3:		; CHECK-LABEL: test3:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov x9, v0.d[1]
; CHECK-NEXT: add x8, x0, x1, lsl #4		; CHECK-NEXT: add x8, x0, x1, lsl #4
; CHECK-NEXT: ldr d0, [x8, #8]		; CHECK-NEXT: ldr d1, [x8, #8]
; CHECK-NEXT: fmov d1, x9		; CHECK-NEXT: dup v1.2d, v1.d[0]
; CHECK-NEXT: pmull v0.1q, v0.1d, v1.1d		; CHECK-NEXT: pmull2 v0.1q, v0.2d, v1.2d
; CHECK-NEXT: str q0, [x8]		; CHECK-NEXT: str q0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%5 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1		%5 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1
%6 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1, i64 1		%6 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1, i64 1
%7 = load i64, ptr %6, align 8		%7 = load i64, ptr %6, align 8
%8 = extractelement <2 x i64> %3, i64 1		%8 = extractelement <2 x i64> %3, i64 1
%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %7, i64 %8)		%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %7, i64 %8)
store <16 x i8> %9, ptr %5, align 16		store <16 x i8> %9, ptr %5, align 16
Show All 15 Lines	; CHECK-NEXT: ret
%8 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %7, i64 %3)		%8 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %7, i64 %3)
store <16 x i8> %8, ptr %5, align 16		store <16 x i8> %8, ptr %5, align 16
ret void		ret void
}		}

define void @test5(ptr %0, <2 x i64> %1, i64 %2) {		define void @test5(ptr %0, <2 x i64> %1, i64 %2) {
; CHECK-LABEL: test5:		; CHECK-LABEL: test5:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov x8, v0.d[1]		; CHECK-NEXT: fmov d1, x1
; CHECK-NEXT: fmov d0, x1		; CHECK-NEXT: dup v1.2d, v1.d[0]
; CHECK-NEXT: fmov d1, x8		; CHECK-NEXT: pmull2 v0.1q, v0.2d, v1.2d
; CHECK-NEXT: pmull v0.1q, v1.1d, v0.1d
; CHECK-NEXT: str q0, [x0]		; CHECK-NEXT: str q0, [x0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%4 = extractelement <2 x i64> %1, i64 1		%4 = extractelement <2 x i64> %1, i64 1
%5 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %4, i64 %2)		%5 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %4, i64 %2)
store <16 x i8> %5, ptr %0, align 16		store <16 x i8> %5, ptr %0, align 16
ret void		ret void
}		}

declare <16 x i8> @llvm.aarch64.neon.pmull64(i64, i64)		declare <16 x i8> @llvm.aarch64.neon.pmull64(i64, i64)