This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.h
5/5
AArch64ISelLowering.cpp
-
AArch64InstrFormats.td
1/1
AArch64InstrInfo.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
aarch64-pmull2.ll
-
pmull-ldr-merge.ll

Differential D131047

[AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new SDNode.
ClosedPublic

Authored by mingmingl on Aug 2 2022, 11:25 PM.

Download Raw Diff

Details

Reviewers

efriedma
dmgreen

Commits

rG945a3065015a: [AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new

Summary

How:

Add AArch64ISD::PMULL SDNode, and extend aarch64_neon_pmull intrinsic tablegen pattern for this SDNode.
For aarch64_neon_pmull64, canonicalize i64 operands to v1i64 vectors during legalization.
For {aarch64_neon_pmull, aarch64_neon_pmull64}, combine intrinsic to SDNode. in dag-combiner

Why

aarch64_neon_pmull64 is the motivating use case. Adding the SDNode makes it easier to canonicalize i64 inputs to vector inputs. Vector inputs carries lane information, which helps dag-combiner to combine nodes (e.g. rewrite to a better node to prepare for instruction selection); as a result, instruction-selection could emit instructions that use higher-half inputs in place (i.e., no need to move lane 1 content to lane 0).
For aarch64_neon_pmull, using the SDNode is NFC, yet without this we have to move the definition of {PMULLv1i64, PMULLv2i64} out of its current group of records without gains.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mingmingl created this revision.Aug 2 2022, 11:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 11:25 PM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

mingmingl requested review of this revision.Aug 2 2022, 11:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 11:25 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B178950: Diff 449552.Aug 2 2022, 11:26 PM

mingmingl edited the summary of this revision. (Show Details)Aug 2 2022, 11:27 PM

mingmingl added reviewers: efriedma, dmgreen.

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

mingmingl mentioned this in D130100: [AArch64] Combine a load into GPR followed by a copy to FPR to a load into FPR directly through MIPeepholeOpt.Aug 3 2022, 1:53 PM

In D131047#3696069, @dmgreen wrote:

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

Thanks for pointing out tryCombineLongOpWithDup (wasn't aware of this approach before); it looks reasonable to me to follow the dag-combiner path. Plan changes.

Updated the tablegen pattern to generate DUPv2i64gpr directly, with the motivation in inline replies.

In D131047#3698635, @mingmingl wrote:

In D131047#3696069, @dmgreen wrote:

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

Thanks for pointing out tryCombineLongOpWithDup (wasn't aware of this approach before); it looks reasonable to me to follow the dag-combiner path. Plan changes.

After a study (details elaborated below), update the tablegen pattern to use DUPv2i64gpr to dup from GPR directly.

Some contexts on why updating tablegen pattern:

Ultimately, we want dup V128, GPR64 as opposed to fmov V64 GPR (or dup V64 GPR) followed by a duplane64 V128, 0; so tablegen pattern use DUPv2i64gpr (dup from main to SIMD register) to express the intention.
Alternatively, this two-step approach works
- First step is to transform GPR64 to extractelt (duplane64 (scalar_to_vector V128, GPR64), 0), 1 (in AArch64ISelLowering.cpp) so pmull2 is used, and second step is to transform (duplane64 (scalar_to_vector V128, GPR64), 0) to dup V128, GPR (in AArch64InstrInfo.td)
- https://reviews.llvm.org/differential/diff/452482/ shows two-step approach, with {cpp, td, test} changes. Note the tests are updated in the same way as this patch.
Both this patch and the alternative option in 2, retain duplane64 throughout the legalize/combine steps, until instruction selection really begins. This is because extactelt(dup x) will be combined into x (code where this happens)
tryCombineLongOpWithDup is suitable when operands are vector, but intrinsic llvm.aarch64.neon.pmull64 has scalar operands.
- Vectorizing a scalar operand during dag-combiner is not preferred here, since new operands of a combined node won't be appended to worklist for further combination (only the combined node and its users will be added to the worklist).
- Even if new operands could be added to worklist for further combination opportunities, we want to retain duplane64 (rather than combining it into dup) as mentioned in 3.

(Sorry for the delay; I did some study for the findings and then took a one-week vacation)

Harbormaster completed remote builds in B181138: Diff 452482.Aug 14 2022, 1:22 AM

Decide to go with a tablegen pattern for dup, and vectorization for aarch64.neon.pmumll64 intrinisc.

The alternative is a tablegen pattern for instrinsic [1]. This alternative is sub-optimal since it create new nodes (to dup from GPR) in the final instruction stage, and missing chances of combination in dag-combiners.
- For example, if i64 is the lower-half of SIMD registers, we want to dup from lane, directly rather than generating a fmov (from SIMD lane 0 to GPR) followed by a dup (from GPR to all lanes of SIMD). test2 in aarch64-pmull2 is the corresponding test for this.

[1]

def : Pat<(int_aarch64_neon_pmull64 (extractelt (v2i64 V128:$Rn), (i64 1)),
                                    GPR64:$Rm),
          (PMULLv2i64 V128:$Rn, (v2i64 (DUPv2i64gpr GPR64:$Rm)))>;

Harbormaster completed remote builds in B181193: Diff 452557.Aug 14 2022, 12:50 PM

Can you upload with full context?

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future. It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

In D131047#3723365, @dmgreen wrote:

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Re aarch64_neon_pmull64 relies on i64 inputs, yeah this means [1] won't compile, and [2] will require a v1i64 -> i64 change, which is not desired (details in [3])

[1]

// with "Type set is empty for each HW mode" error from tablegen 
def : Pat<(int_aarch64_neon_pmull64 (v1i64 (extract_subvector V128:$Rn, (i64 1))),
                                    (v1i64 (extract_subvector V128:$Rm, (i64 1)))),
                (PMULLv2i64 V128:$Rn, V128:$Rm)>

[2]

def : Pat<(int_aarch64_neon_pmull64 (i64 (bitconvert (v1i64 (extract_subvector V128:$Rn, (i64 1))))),
                                    (i64 (bitconvert (v1i64 (extract_subvector V128:$Rm, (i64 1)))))),
          (PMULLv2i64 V128:$Rn, V128:$Rm)>;

[3]
First, v1i64->i64 reverses i64->v1i64, so having both of them either gives indefinite "combine->legalization->combine" loop ( nodes created by dag-combiner will be re-legalized ) , or makes the steps super stateful and complex.
Secondly, PreprocessISelDAG, the per-target hook right before ISel begins, isn't implemented in aarch64 backend. And implementing it (for this use case) comes with a non-eligible cost to iterate all instructions.

In D131047#3725122, @mingmingl wrote:

In D131047#3723365, @dmgreen wrote:

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

I meant update the patch with -U9999999 as per https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface. It makes the reviews easier to read.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Yeah, a AArch64ISD::PMULL64 node. It can hopefully make the intent that the inputs are i64's in vector registers easier to specify.

Add SDNode pmull and use it for both aarch64_neon_pmull and aarch64_neon_pmull64.

In D131047#3725210, @dmgreen wrote:

In D131047#3725122, @mingmingl wrote:

In D131047#3723365, @dmgreen wrote:

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

I meant update the patch with -U9999999 as per https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface. It makes the reviews easier to read.

Got it. Used "git show HEAD -U999999 > mypatch.patch" this time.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Yeah, a AArch64ISD::PMULL64 node. It can hopefully make the intent that the inputs are i64's in vector registers easier to specify.

Done with two things to mention

While only one operand is a higher-half, the other operand is legalized with DUP.
- If the other operand is a lower-half, DUPLANE64 is better. Use a FIXME assuming aarch64_neon_pmull64(higher-half, lower-half) is not common. This is feasible to fix (using Optional<uint64_t> to represent lane1, lane0 or not SIMD) but just add a little more code complexity.
It seems to me that ld1r requires the base address to be a GPR so address with offset becomes another add instruction; while with ldr (small) offsets could be folded into instruction itself. So use ISD::SCALAR_TO_VECTOR for i64->v1i64 rather than ISD::AArch64Dup.

Let me know if I miss anything in 1) or 2). Thanks!

Harbormaster completed remote builds in B181692: Diff 453214.Aug 17 2022, 12:56 AM

Update after precommit tests are simplified so that diff is clearer.

Harbormaster completed remote builds in B181698: Diff 453229.Aug 17 2022, 1:18 AM

dmgreen added inline comments.Aug 17 2022, 9:47 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4614–4616	Return a AArch64ISD::PMULL here...
4634	Will the type ever not be a i64?
16696–16697	.. then I don't think this is needed, we have already "lowered" pmull64.
llvm/lib/Target/AArch64/AArch64InstrInfo.td
674	Add a newline like the others below to keep the line-length down. I don't think there is a strict line length in these files, but we try to keep the lines getting too long.

Update include:

address comments
took the liberty to detect both lane 1 and lane 0 (previously only lane 1) so duplane64 (as opposed to dup-from-gpr) is used for lane 0, as a small optimization.

Also at this point, test4 in pmull-ldr-merge.ll is not about ldr, so aarch64-pmull.ll is a better place. I could send an NFC change later , so as not to frequently updating the precommit test patch ( an NFC update to precommit test is not obvious in the UI)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4634	By lowering `aarch64_neon_pmull64` to `AArch64ISD::PMULL` around line 4583 below (as suggested), the type would always be i64. So added an assert -> without `intrinsic->AArch64ISD::PMULL` lowering, the new intrinsic SDNode will be added to node list of the DAG (code) for legalization; by then the type is not i64.

Harbormaster completed remote builds in B181916: Diff 453543.Aug 18 2022, 12:04 AM

Changes

Fix a subtle C++ bug in static lambda TryVectorizeOperand-> the helper function is declared as lambda to limit scope (no need to sanity check parameters), but it should really not capture variables (that could change per invocation). This issue just occurred to me when looking at the codebase.
In tablegen pattern SIMDDifferentThreeVectorBD, remove default parameter (null) for OpNode since there isn't a use case for default parameter now.

Harbormaster completed remote builds in B182096: Diff 453798.Aug 18 2022, 2:52 PM

Thanks. LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4576	Remove static from this variable. I'm pretty sure it only really applies to the TryVectorizeOperand variable, not anything to do with the lambda, so needn't be static.

This revision is now accepted and ready to land.Aug 19 2022, 2:06 AM

mingmingl mentioned this in D131045: [NFC][AArch64] Precommit test to optimize instruction selection for aarch64_neon_pmull64 intrinsic..Aug 19 2022, 10:14 AM

Resolve comments.

In D131047#3734573, @dmgreen wrote:

Thanks. LGTM

Thanks for consistently steering the patch in the right direction!

I learnt I should have asked for review of D131045 in the first place when it came to commit time. It's better late than never, so did that just now.

Harbormaster completed remote builds in B182262: Diff 454056.Aug 19 2022, 10:32 AM

This revision was landed with ongoing or failed builds.Aug 19 2022, 1:18 PM

Closed by commit rG945a3065015a: [AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new (authored by mingmingl). · Explain Why

This revision was automatically updated to reflect the committed changes.

mingmingl added a commit: rG945a3065015a: [AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new.

Allen mentioned this in D140649: [AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcount.Dec 24 2022, 12:32 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

2 lines

AArch64ISelLowering.cpp

91 lines

AArch64InstrFormats.td

15 lines

AArch64InstrInfo.td

11 lines

test/

CodeGen/

AArch64/

aarch64-pmull2.ll

31 lines

pmull-ldr-merge.ll

13 lines

Diff 454094

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 285 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
/// mode without emitting such REV instructions.		/// mode without emitting such REV instructions.
NVCAST,		NVCAST,

MRS, // MRS, also sets the flags via a glue.		MRS, // MRS, also sets the flags via a glue.

SMULL,		SMULL,
UMULL,		UMULL,

		PMULL,

// Reciprocal estimates and steps.		// Reciprocal estimates and steps.
FRECPE,		FRECPE,
FRECPS,		FRECPS,
FRSQRTE,		FRSQRTE,
FRSQRTS,		FRSQRTS,

SUNPKHI,		SUNPKHI,
SUNPKLO,		SUNPKLO,
▲ Show 20 Lines • Show All 867 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,251 Lines • ▼ Show 20 Lines	case AArch64ISD::FIRST_NUMBER:
MAKE_CASE(AArch64ISD::LD2LANEpost)		MAKE_CASE(AArch64ISD::LD2LANEpost)
MAKE_CASE(AArch64ISD::LD3LANEpost)		MAKE_CASE(AArch64ISD::LD3LANEpost)
MAKE_CASE(AArch64ISD::LD4LANEpost)		MAKE_CASE(AArch64ISD::LD4LANEpost)
MAKE_CASE(AArch64ISD::ST2LANEpost)		MAKE_CASE(AArch64ISD::ST2LANEpost)
MAKE_CASE(AArch64ISD::ST3LANEpost)		MAKE_CASE(AArch64ISD::ST3LANEpost)
MAKE_CASE(AArch64ISD::ST4LANEpost)		MAKE_CASE(AArch64ISD::ST4LANEpost)
MAKE_CASE(AArch64ISD::SMULL)		MAKE_CASE(AArch64ISD::SMULL)
MAKE_CASE(AArch64ISD::UMULL)		MAKE_CASE(AArch64ISD::UMULL)
		MAKE_CASE(AArch64ISD::PMULL)
MAKE_CASE(AArch64ISD::FRECPE)		MAKE_CASE(AArch64ISD::FRECPE)
MAKE_CASE(AArch64ISD::FRECPS)		MAKE_CASE(AArch64ISD::FRECPS)
MAKE_CASE(AArch64ISD::FRSQRTE)		MAKE_CASE(AArch64ISD::FRSQRTE)
MAKE_CASE(AArch64ISD::FRSQRTS)		MAKE_CASE(AArch64ISD::FRSQRTS)
MAKE_CASE(AArch64ISD::STG)		MAKE_CASE(AArch64ISD::STG)
MAKE_CASE(AArch64ISD::STZG)		MAKE_CASE(AArch64ISD::STZG)
MAKE_CASE(AArch64ISD::ST2G)		MAKE_CASE(AArch64ISD::ST2G)
MAKE_CASE(AArch64ISD::STZ2G)		MAKE_CASE(AArch64ISD::STZ2G)
▲ Show 20 Lines • Show All 1,930 Lines • ▼ Show 20 Lines	if (OrigTy.getSizeInBits() >= 64)
return N;		return N;

// Must extend size to at least 64 bits to be used as an operand for VMULL.		// Must extend size to at least 64 bits to be used as an operand for VMULL.
EVT NewVT = getExtensionTo64Bits(OrigTy);		EVT NewVT = getExtensionTo64Bits(OrigTy);

return DAG.getNode(ExtOpcode, SDLoc(N), NewVT, N);		return DAG.getNode(ExtOpcode, SDLoc(N), NewVT, N);
}		}

static bool isOperandOfHigherHalf(SDValue &Op) {		// Returns lane if Op extracts from a two-element vector and lane is constant
		// (i.e., extractelt(<2 x Ty> %v, ConstantLane)), and None otherwise.
		static Optional<uint64_t> getConstantLaneNumOfExtractHalfOperand(SDValue &Op) {
SDNode *OpNode = Op.getNode();		SDNode *OpNode = Op.getNode();
if (OpNode->getOpcode() != ISD::EXTRACT_VECTOR_ELT)		if (OpNode->getOpcode() != ISD::EXTRACT_VECTOR_ELT)
return false;		return None;

ConstantSDNode *C = dyn_cast<ConstantSDNode>(OpNode->getOperand(1));
if (!C \|\| C->getZExtValue() != 1)
return false;

EVT VT = OpNode->getOperand(0).getValueType();		EVT VT = OpNode->getOperand(0).getValueType();
		ConstantSDNode *C = dyn_cast<ConstantSDNode>(OpNode->getOperand(1));
		if (!VT.isFixedLengthVector() \|\| VT.getVectorNumElements() != 2 \|\| !C)
		return None;

return VT.isFixedLengthVector() && VT.getVectorNumElements() == 2;		return C->getZExtValue();
}

static bool areOperandsOfHigherHalf(SDValue &Op1, SDValue &Op2) {
return isOperandOfHigherHalf(Op1) && isOperandOfHigherHalf(Op2);
}		}

static bool isExtendedBUILD_VECTOR(SDNode *N, SelectionDAG &DAG,		static bool isExtendedBUILD_VECTOR(SDNode *N, SelectionDAG &DAG,
bool isSigned) {		bool isSigned) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (N->getOpcode() != ISD::BUILD_VECTOR)		if (N->getOpcode() != ISD::BUILD_VECTOR)
return false;		return false;
▲ Show 20 Lines • Show All 327 Lines • ▼ Show 20 Lines	if (Ty == MVT::i64) {
return DAG.getNode(ISD::BITCAST, dl, MVT::i64, Result);		return DAG.getNode(ISD::BITCAST, dl, MVT::i64, Result);
} else if (Ty.isVector() && Ty.isInteger() && isTypeLegal(Ty)) {		} else if (Ty.isVector() && Ty.isInteger() && isTypeLegal(Ty)) {
return DAG.getNode(ISD::ABS, dl, Ty, Op.getOperand(1));		return DAG.getNode(ISD::ABS, dl, Ty, Op.getOperand(1));
} else {		} else {
report_fatal_error("Unexpected type for AArch64 NEON intrinic");		report_fatal_error("Unexpected type for AArch64 NEON intrinic");
}		}
}		}
case Intrinsic::aarch64_neon_pmull64: {		case Intrinsic::aarch64_neon_pmull64: {
SDValue Op1 = Op.getOperand(1);		SDValue LHS = Op.getOperand(1);
SDValue Op2 = Op.getOperand(2);		SDValue RHS = Op.getOperand(2);

// If both operands are higher half of two source SIMD & FP registers,		Optional<uint64_t> LHSLane = getConstantLaneNumOfExtractHalfOperand(LHS);
// ISel could make use of tablegen patterns to emit PMULL2. So do not		Optional<uint64_t> RHSLane = getConstantLaneNumOfExtractHalfOperand(RHS);
// legalize i64 to v1i64.
if (areOperandsOfHigherHalf(Op1, Op2))
return SDValue();

// As a general convention, use "v1" types to represent scalar integer		assert((!LHSLane \|\| *LHSLane < 2) && "Expect lane to be None or 0 or 1");
// operations in vector registers. This helps ISel to make use of		assert((!RHSLane \|\| *RHSLane < 2) && "Expect lane to be None or 0 or 1");
// tablegen patterns and generate a load into SIMD & FP registers directly.
if (Op1.getValueType() == MVT::i64)
Op1 = DAG.getNode(ISD::BITCAST, dl, MVT::v1i64, Op1);
if (Op2.getValueType() == MVT::i64)
Op2 = DAG.getNode(ISD::BITCAST, dl, MVT::v1i64, Op2);

return DAG.getNode(		// 'aarch64_neon_pmull64' takes i64 parameters; while pmull/pmull2
ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(),		// instructions execute on SIMD registers. So canonicalize i64 to v1i64,
DAG.getConstant(Intrinsic::aarch64_neon_pmull64, dl, MVT::i32), Op1,		// which ISel recognizes better. For example, generate a ldr into d*
Op2);		// registers as opposed to a GPR load followed by a fmov.
		auto TryVectorizeOperand =
		dmgreenUnsubmitted Done Reply Inline Actions Remove static from this variable. I'm pretty sure it only really applies to the TryVectorizeOperand variable, not anything to do with the lambda, so needn't be static. dmgreen: Remove static from this variable. I'm pretty sure it only really applies to the…
		[](SDValue N, Optional<uint64_t> NLane, Optional<uint64_t> OtherLane,
		const SDLoc &dl, SelectionDAG &DAG) -> SDValue {
		// If the operand is an higher half itself, rewrite it to
		// extract_high_v2i64; this way aarch64_neon_pmull64 could
		// re-use the dag-combiner function with aarch64_neon_{pmull,smull,umull}.
		if (NLane && *NLane == 1)
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v1i64,
		N.getOperand(0), DAG.getConstant(1, dl, MVT::i64));

		// Operand N is not a higher half but the other operand is.
		if (OtherLane && *OtherLane == 1) {
		// If this operand is a lower half, rewrite it to
		// extract_high_v2i64(duplane(<2 x Ty>, 0)). This saves a roundtrip to
		// align lanes of two operands. A roundtrip sequence (to move from lane
		// 1 to lane 0) is like this:
		// mov x8, v0.d[1]
		// fmov d0, x8
		if (NLane && *NLane == 0)
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v1i64,
		DAG.getNode(AArch64ISD::DUPLANE64, dl, MVT::v2i64,
		N.getOperand(0),
		DAG.getConstant(0, dl, MVT::i64)),
		DAG.getConstant(1, dl, MVT::i64));

		// Otherwise just dup from main to all lanes.
		return DAG.getNode(AArch64ISD::DUP, dl, MVT::v1i64, N);
		}

		// Neither operand is an extract of higher half, so codegen may just use
		// the non-high version of PMULL instruction. Use v1i64 to represent i64.
		assert(N.getValueType() == MVT::i64 &&
		"Intrinsic aarch64_neon_pmull64 requires i64 parameters");
		return DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v1i64, N);
		};

		LHS = TryVectorizeOperand(LHS, LHSLane, RHSLane, dl, DAG);
		RHS = TryVectorizeOperand(RHS, RHSLane, LHSLane, dl, DAG);

		return DAG.getNode(AArch64ISD::PMULL, dl, Op.getValueType(), LHS, RHS);
}		}
		dmgreenUnsubmitted Done Reply Inline Actions Return a AArch64ISD::PMULL here... dmgreen: Return a AArch64ISD::PMULL here...
case Intrinsic::aarch64_neon_smax:		case Intrinsic::aarch64_neon_smax:
return DAG.getNode(ISD::SMAX, dl, Op.getValueType(),		return DAG.getNode(ISD::SMAX, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
case Intrinsic::aarch64_neon_umax:		case Intrinsic::aarch64_neon_umax:
return DAG.getNode(ISD::UMAX, dl, Op.getValueType(),		return DAG.getNode(ISD::UMAX, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
case Intrinsic::aarch64_neon_smin:		case Intrinsic::aarch64_neon_smin:
return DAG.getNode(ISD::SMIN, dl, Op.getValueType(),		return DAG.getNode(ISD::SMIN, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
case Intrinsic::aarch64_neon_umin:		case Intrinsic::aarch64_neon_umin:
return DAG.getNode(ISD::UMIN, dl, Op.getValueType(),		return DAG.getNode(ISD::UMIN, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));

case Intrinsic::aarch64_sve_sunpkhi:		case Intrinsic::aarch64_sve_sunpkhi:
return DAG.getNode(AArch64ISD::SUNPKHI, dl, Op.getValueType(),		return DAG.getNode(AArch64ISD::SUNPKHI, dl, Op.getValueType(),
Op.getOperand(1));		Op.getOperand(1));
case Intrinsic::aarch64_sve_sunpklo:		case Intrinsic::aarch64_sve_sunpklo:
return DAG.getNode(AArch64ISD::SUNPKLO, dl, Op.getValueType(),		return DAG.getNode(AArch64ISD::SUNPKLO, dl, Op.getValueType(),
		dmgreenUnsubmitted Done Reply Inline Actions Will the type ever not be a i64? dmgreen: Will the type ever not be a i64?
		mingminglAuthorUnsubmitted Done Reply Inline Actions By lowering `aarch64_neon_pmull64` to `AArch64ISD::PMULL` around line 4583 below (as suggested), the type would always be i64. So added an assert -> without `intrinsic->AArch64ISD::PMULL` lowering, the new intrinsic SDNode will be added to node list of the DAG (code) for legalization; by then the type is not i64. mingmingl: By lowering `aarch64_neon_pmull64` to `AArch64ISD::PMULL` around line 4583 below (as suggested)…
Op.getOperand(1));		Op.getOperand(1));
case Intrinsic::aarch64_sve_uunpkhi:		case Intrinsic::aarch64_sve_uunpkhi:
return DAG.getNode(AArch64ISD::UUNPKHI, dl, Op.getValueType(),		return DAG.getNode(AArch64ISD::UUNPKHI, dl, Op.getValueType(),
Op.getOperand(1));		Op.getOperand(1));
case Intrinsic::aarch64_sve_uunpklo:		case Intrinsic::aarch64_sve_uunpklo:
return DAG.getNode(AArch64ISD::UUNPKLO, dl, Op.getValueType(),		return DAG.getNode(AArch64ISD::UUNPKLO, dl, Op.getValueType(),
Op.getOperand(1));		Op.getOperand(1));
case Intrinsic::aarch64_sve_clasta_n:		case Intrinsic::aarch64_sve_clasta_n:
▲ Show 20 Lines • Show All 12,043 Lines • ▼ Show 20 Lines	return DAG.getNode(ISD::FMINNUM, SDLoc(N), N->getValueType(0),
N->getOperand(1), N->getOperand(2));		N->getOperand(1), N->getOperand(2));
case Intrinsic::aarch64_neon_smull:		case Intrinsic::aarch64_neon_smull:
return DAG.getNode(AArch64ISD::SMULL, SDLoc(N), N->getValueType(0),		return DAG.getNode(AArch64ISD::SMULL, SDLoc(N), N->getValueType(0),
N->getOperand(1), N->getOperand(2));		N->getOperand(1), N->getOperand(2));
case Intrinsic::aarch64_neon_umull:		case Intrinsic::aarch64_neon_umull:
return DAG.getNode(AArch64ISD::UMULL, SDLoc(N), N->getValueType(0),		return DAG.getNode(AArch64ISD::UMULL, SDLoc(N), N->getValueType(0),
N->getOperand(1), N->getOperand(2));		N->getOperand(1), N->getOperand(2));
case Intrinsic::aarch64_neon_pmull:		case Intrinsic::aarch64_neon_pmull:
		return DAG.getNode(AArch64ISD::PMULL, SDLoc(N), N->getValueType(0),
		N->getOperand(1), N->getOperand(2));
case Intrinsic::aarch64_neon_sqdmull:		case Intrinsic::aarch64_neon_sqdmull:
return tryCombineLongOpWithDup(IID, N, DCI, DAG);		return tryCombineLongOpWithDup(IID, N, DCI, DAG);
		dmgreenUnsubmitted Done Reply Inline Actions .. then I don't think this is needed, we have already "lowered" pmull64. dmgreen: .. then I don't think this is needed, we have already "lowered" pmull64.
case Intrinsic::aarch64_neon_sqshl:		case Intrinsic::aarch64_neon_sqshl:
case Intrinsic::aarch64_neon_uqshl:		case Intrinsic::aarch64_neon_uqshl:
case Intrinsic::aarch64_neon_sqshlu:		case Intrinsic::aarch64_neon_sqshlu:
case Intrinsic::aarch64_neon_srshl:		case Intrinsic::aarch64_neon_srshl:
case Intrinsic::aarch64_neon_urshl:		case Intrinsic::aarch64_neon_urshl:
case Intrinsic::aarch64_neon_sshl:		case Intrinsic::aarch64_neon_sshl:
case Intrinsic::aarch64_neon_ushl:		case Intrinsic::aarch64_neon_ushl:
return tryCombineShiftImm(IID, N, DAG);		return tryCombineShiftImm(IID, N, DAG);
▲ Show 20 Lines • Show All 3,121 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::EXTRACT_VECTOR_ELT:		case ISD::EXTRACT_VECTOR_ELT:
return performExtractVectorEltCombine(N, DCI, Subtarget);		return performExtractVectorEltCombine(N, DCI, Subtarget);
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
return performVecReduceAddCombine(N, DCI.DAG, Subtarget);		return performVecReduceAddCombine(N, DCI.DAG, Subtarget);
case AArch64ISD::UADDV:		case AArch64ISD::UADDV:
return performUADDVCombine(N, DAG);		return performUADDVCombine(N, DAG);
case AArch64ISD::SMULL:		case AArch64ISD::SMULL:
case AArch64ISD::UMULL:		case AArch64ISD::UMULL:
		case AArch64ISD::PMULL:
return tryCombineLongOpWithDup(Intrinsic::not_intrinsic, N, DCI, DAG);		return tryCombineLongOpWithDup(Intrinsic::not_intrinsic, N, DCI, DAG);
case ISD::INTRINSIC_VOID:		case ISD::INTRINSIC_VOID:
case ISD::INTRINSIC_W_CHAIN:		case ISD::INTRINSIC_W_CHAIN:
switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {		switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {
case Intrinsic::aarch64_sve_prfb_gather_scalar_offset:		case Intrinsic::aarch64_sve_prfb_gather_scalar_offset:
return combineSVEPrefetchVecBaseImmOff(N, DAG, 1 /=ScalarSizeInBytes/);		return combineSVEPrefetchVecBaseImmOff(N, DAG, 1 /=ScalarSizeInBytes/);
case Intrinsic::aarch64_sve_prfh_gather_scalar_offset:		case Intrinsic::aarch64_sve_prfh_gather_scalar_offset:
return combineSVEPrefetchVecBaseImmOff(N, DAG, 2 /=ScalarSizeInBytes/);		return combineSVEPrefetchVecBaseImmOff(N, DAG, 2 /=ScalarSizeInBytes/);
▲ Show 20 Lines • Show All 2,376 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrFormats.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 111 Lines • ▼ Show 20 Lines
// Helper fragment for an extract of the high portion of a 128-bit vector. The		// Helper fragment for an extract of the high portion of a 128-bit vector. The
// ComplexPattern match both extract_subvector and bitcast(extract_subvector(..)).		// ComplexPattern match both extract_subvector and bitcast(extract_subvector(..)).
def extract_high_v16i8 :		def extract_high_v16i8 :
ComplexPattern<v8i8, 1, "SelectExtractHigh", [extract_subvector, bitconvert]>;		ComplexPattern<v8i8, 1, "SelectExtractHigh", [extract_subvector, bitconvert]>;
def extract_high_v8i16 :		def extract_high_v8i16 :
ComplexPattern<v4i16, 1, "SelectExtractHigh", [extract_subvector, bitconvert]>;		ComplexPattern<v4i16, 1, "SelectExtractHigh", [extract_subvector, bitconvert]>;
def extract_high_v4i32 :		def extract_high_v4i32 :
ComplexPattern<v2i32, 1, "SelectExtractHigh", [extract_subvector, bitconvert]>;		ComplexPattern<v2i32, 1, "SelectExtractHigh", [extract_subvector, bitconvert]>;
		def extract_high_v2i64 :
		ComplexPattern<v1i64, 1, "SelectExtractHigh", [extract_subvector, bitconvert]>;

def extract_high_dup_v8i16 :		def extract_high_dup_v8i16 :
BinOpFrag<(extract_subvector (v8i16 (AArch64duplane16 (v8i16 node:$LHS), node:$RHS)), (i64 4))>;		BinOpFrag<(extract_subvector (v8i16 (AArch64duplane16 (v8i16 node:$LHS), node:$RHS)), (i64 4))>;
def extract_high_dup_v4i32 :		def extract_high_dup_v4i32 :
BinOpFrag<(extract_subvector (v4i32 (AArch64duplane32 (v4i32 node:$LHS), node:$RHS)), (i64 2))>;		BinOpFrag<(extract_subvector (v4i32 (AArch64duplane32 (v4i32 node:$LHS), node:$RHS)), (i64 2))>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Asm Operand Classes.		// Asm Operand Classes.
▲ Show 20 Lines • Show All 6,369 Lines • ▼ Show 20 Lines	multiclass SIMDNarrowThreeVectorBHS<bit U, bits<4> opc, string asm,
def : Pat<(concat_vectors (v2i32 V64:$Rd), (IntOp (v2i64 V128:$Rn),		def : Pat<(concat_vectors (v2i32 V64:$Rd), (IntOp (v2i64 V128:$Rn),
(v2i64 V128:$Rm))),		(v2i64 V128:$Rm))),
(!cast<Instruction>(NAME # "v2i64_v4i32")		(!cast<Instruction>(NAME # "v2i64_v4i32")
(INSERT_SUBREG (IMPLICIT_DEF), V64:$Rd, dsub),		(INSERT_SUBREG (IMPLICIT_DEF), V64:$Rd, dsub),
V128:$Rn, V128:$Rm)>;		V128:$Rn, V128:$Rm)>;
}		}

multiclass SIMDDifferentThreeVectorBD<bit U, bits<4> opc, string asm,		multiclass SIMDDifferentThreeVectorBD<bit U, bits<4> opc, string asm,
Intrinsic IntOp> {		SDPatternOperator OpNode> {
def v8i8 : BaseSIMDDifferentThreeVector<U, 0b000, opc,		def v8i8 : BaseSIMDDifferentThreeVector<U, 0b000, opc,
V128, V64, V64,		V128, V64, V64,
asm, ".8h", ".8b", ".8b",		asm, ".8h", ".8b", ".8b",
[(set (v8i16 V128:$Rd), (IntOp (v8i8 V64:$Rn), (v8i8 V64:$Rm)))]>;		[(set (v8i16 V128:$Rd), (OpNode (v8i8 V64:$Rn), (v8i8 V64:$Rm)))]>;
def v16i8 : BaseSIMDDifferentThreeVector<U, 0b001, opc,		def v16i8 : BaseSIMDDifferentThreeVector<U, 0b001, opc,
V128, V128, V128,		V128, V128, V128,
asm#"2", ".8h", ".16b", ".16b", []>;		asm#"2", ".8h", ".16b", ".16b", []>;
let Predicates = [HasAES] in {		let Predicates = [HasAES] in {
def v1i64 : BaseSIMDDifferentThreeVector<U, 0b110, opc,		def v1i64 : BaseSIMDDifferentThreeVector<U, 0b110, opc,
V128, V64, V64,		V128, V64, V64,
asm, ".1q", ".1d", ".1d", []>;		asm, ".1q", ".1d", ".1d",
		[(set (v16i8 V128:$Rd), (OpNode (v1i64 V64:$Rn), (v1i64 V64:$Rm)))]>;
def v2i64 : BaseSIMDDifferentThreeVector<U, 0b111, opc,		def v2i64 : BaseSIMDDifferentThreeVector<U, 0b111, opc,
V128, V128, V128,		V128, V128, V128,
asm#"2", ".1q", ".2d", ".2d", []>;		asm#"2", ".1q", ".2d", ".2d",
		[(set (v16i8 V128:$Rd), (OpNode (extract_high_v2i64 (v2i64 V128:$Rn)),
		(extract_high_v2i64 (v2i64 V128:$Rm))))]>;
}		}

def : Pat<(v8i16 (IntOp (v8i8 (extract_high_v16i8 (v16i8 V128:$Rn))),		def : Pat<(v8i16 (OpNode (v8i8 (extract_high_v16i8 (v16i8 V128:$Rn))),
(v8i8 (extract_high_v16i8 (v16i8 V128:$Rm))))),		(v8i8 (extract_high_v16i8 (v16i8 V128:$Rm))))),
(!cast<Instruction>(NAME#"v16i8") V128:$Rn, V128:$Rm)>;		(!cast<Instruction>(NAME#"v16i8") V128:$Rn, V128:$Rm)>;
}		}

multiclass SIMDLongThreeVectorHS<bit U, bits<4> opc, string asm,		multiclass SIMDLongThreeVectorHS<bit U, bits<4> opc, string asm,
SDPatternOperator OpNode> {		SDPatternOperator OpNode> {
def v4i16_v4i32 : BaseSIMDDifferentThreeVector<U, 0b010, opc,		def v4i16_v4i32 : BaseSIMDDifferentThreeVector<U, 0b010, opc,
V128, V64, V64,		V128, V64, V64,
▲ Show 20 Lines • Show All 5,052 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 665 Lines • ▼ Show 20 Lines

def AArch64WrapperLarge : SDNode<"AArch64ISD::WrapperLarge",		def AArch64WrapperLarge : SDNode<"AArch64ISD::WrapperLarge",
SDT_AArch64WrapperLarge>;		SDT_AArch64WrapperLarge>;

def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;		def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;

def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,		def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,
SDTCisSameAs<1, 2>]>;		SDTCisSameAs<1, 2>]>;
		def AArch64pmull : SDNode<"AArch64ISD::PMULL", SDT_AArch64mull,
		dmgreenUnsubmitted Done Reply Inline Actions Add a newline like the others below to keep the line-length down. I don't think there is a strict line length in these files, but we try to keep the lines getting too long. dmgreen: Add a newline like the others below to keep the line-length down. I don't think there is a…
		[SDNPCommutative]>;
def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull,		def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull,
[SDNPCommutative]>;		[SDNPCommutative]>;
def AArch64umull : SDNode<"AArch64ISD::UMULL", SDT_AArch64mull,		def AArch64umull : SDNode<"AArch64ISD::UMULL", SDT_AArch64mull,
[SDNPCommutative]>;		[SDNPCommutative]>;

def AArch64frecpe : SDNode<"AArch64ISD::FRECPE", SDTFPUnaryOp>;		def AArch64frecpe : SDNode<"AArch64ISD::FRECPE", SDTFPUnaryOp>;
def AArch64frecps : SDNode<"AArch64ISD::FRECPS", SDTFPBinOp>;		def AArch64frecps : SDNode<"AArch64ISD::FRECPS", SDTFPBinOp>;
def AArch64frsqrte : SDNode<"AArch64ISD::FRSQRTE", SDTFPUnaryOp>;		def AArch64frsqrte : SDNode<"AArch64ISD::FRSQRTE", SDTFPUnaryOp>;
▲ Show 20 Lines • Show All 4,539 Lines • ▼ Show 20 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Advanced SIMD three different-sized vector instructions.		// Advanced SIMD three different-sized vector instructions.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

defm ADDHN : SIMDNarrowThreeVectorBHS<0,0b0100,"addhn", int_aarch64_neon_addhn>;		defm ADDHN : SIMDNarrowThreeVectorBHS<0,0b0100,"addhn", int_aarch64_neon_addhn>;
defm SUBHN : SIMDNarrowThreeVectorBHS<0,0b0110,"subhn", int_aarch64_neon_subhn>;		defm SUBHN : SIMDNarrowThreeVectorBHS<0,0b0110,"subhn", int_aarch64_neon_subhn>;
defm RADDHN : SIMDNarrowThreeVectorBHS<1,0b0100,"raddhn",int_aarch64_neon_raddhn>;		defm RADDHN : SIMDNarrowThreeVectorBHS<1,0b0100,"raddhn",int_aarch64_neon_raddhn>;
defm RSUBHN : SIMDNarrowThreeVectorBHS<1,0b0110,"rsubhn",int_aarch64_neon_rsubhn>;		defm RSUBHN : SIMDNarrowThreeVectorBHS<1,0b0110,"rsubhn",int_aarch64_neon_rsubhn>;
defm PMULL : SIMDDifferentThreeVectorBD<0,0b1110,"pmull",int_aarch64_neon_pmull>;		defm PMULL : SIMDDifferentThreeVectorBD<0,0b1110,"pmull", AArch64pmull>;
defm SABAL : SIMDLongThreeVectorTiedBHSabal<0,0b0101,"sabal",		defm SABAL : SIMDLongThreeVectorTiedBHSabal<0,0b0101,"sabal",
AArch64sabd>;		AArch64sabd>;
defm SABDL : SIMDLongThreeVectorBHSabdl<0, 0b0111, "sabdl",		defm SABDL : SIMDLongThreeVectorBHSabdl<0, 0b0111, "sabdl",
AArch64sabd>;		AArch64sabd>;
defm SADDL : SIMDLongThreeVectorBHS< 0, 0b0000, "saddl",		defm SADDL : SIMDLongThreeVectorBHS< 0, 0b0000, "saddl",
BinOpFrag<(add (sext node:$LHS), (sext node:$RHS))>>;		BinOpFrag<(add (sext node:$LHS), (sext node:$RHS))>>;
defm SADDW : SIMDWideThreeVectorBHS< 0, 0b0001, "saddw",		defm SADDW : SIMDWideThreeVectorBHS< 0, 0b0001, "saddw",
BinOpFrag<(add node:$LHS, (sext node:$RHS))>>;		BinOpFrag<(add node:$LHS, (sext node:$RHS))>>;
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	defm : Neon_mul_acc_widen_patterns<add, AArch64umull,
UMLALv8i8_v8i16, UMLALv4i16_v4i32, UMLALv2i32_v2i64>;		UMLALv8i8_v8i16, UMLALv4i16_v4i32, UMLALv2i32_v2i64>;
defm : Neon_mul_acc_widen_patterns<add, AArch64smull,		defm : Neon_mul_acc_widen_patterns<add, AArch64smull,
SMLALv8i8_v8i16, SMLALv4i16_v4i32, SMLALv2i32_v2i64>;		SMLALv8i8_v8i16, SMLALv4i16_v4i32, SMLALv2i32_v2i64>;
defm : Neon_mul_acc_widen_patterns<sub, AArch64umull,		defm : Neon_mul_acc_widen_patterns<sub, AArch64umull,
UMLSLv8i8_v8i16, UMLSLv4i16_v4i32, UMLSLv2i32_v2i64>;		UMLSLv8i8_v8i16, UMLSLv4i16_v4i32, UMLSLv2i32_v2i64>;
defm : Neon_mul_acc_widen_patterns<sub, AArch64smull,		defm : Neon_mul_acc_widen_patterns<sub, AArch64smull,
SMLSLv8i8_v8i16, SMLSLv4i16_v4i32, SMLSLv2i32_v2i64>;		SMLSLv8i8_v8i16, SMLSLv4i16_v4i32, SMLSLv2i32_v2i64>;

// Patterns for 64-bit pmull
def : Pat<(int_aarch64_neon_pmull64 V64:$Rn, V64:$Rm),
(PMULLv1i64 V64:$Rn, V64:$Rm)>;
def : Pat<(int_aarch64_neon_pmull64 (extractelt (v2i64 V128:$Rn), (i64 1)),
(extractelt (v2i64 V128:$Rm), (i64 1))),
(PMULLv2i64 V128:$Rn, V128:$Rm)>;

// CodeGen patterns for addhn and subhn instructions, which can actually be		// CodeGen patterns for addhn and subhn instructions, which can actually be
// written in LLVM IR without too much difficulty.		// written in LLVM IR without too much difficulty.

// Prioritize ADDHN and SUBHN over UZP2.		// Prioritize ADDHN and SUBHN over UZP2.
let AddedComplexity = 10 in {		let AddedComplexity = 10 in {

// ADDHN		// ADDHN
def : Pat<(v8i8 (trunc (v8i16 (AArch64vlshr (add V128:$Rn, V128:$Rm), (i32 8))))),		def : Pat<(v8i8 (trunc (v8i16 (AArch64vlshr (add V128:$Rn, V128:$Rm), (i32 8))))),
▲ Show 20 Lines • Show All 3,094 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-pmull2.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -verify-machineinstrs -mtriple=aarch64-linux-gnu -mattr=+aes -o - %s\| FileCheck %s --check-prefixes=CHECK		; RUN: llc -verify-machineinstrs -mtriple=aarch64-linux-gnu -mattr=+aes -o - %s\| FileCheck %s --check-prefixes=CHECK

; User code intends to execute {pmull, pmull2} instructions on {lower, higher} half of the same vector registers directly.		; User code intends to execute {pmull, pmull2} instructions on {lower, higher} half of the same vector registers directly.
; Test that PMULL2 are generated for higher-half operands.		; Test that PMULL2 are generated for higher-half operands.
; The suboptimal code generation fails to use higher-half contents in place; instead, it moves higher-lane contents to lower lane		; The suboptimal code generation fails to use higher-half contents in place; instead, it moves higher-lane contents to lower lane
; to make use of PMULL everywhere, and generates unnecessary moves.		; to make use of PMULL everywhere, and generates unnecessary moves.
define void @test1(ptr %0, ptr %1) {		define void @test1(ptr %0, ptr %1) {
; CHECK-LABEL: test1:		; CHECK-LABEL: test1:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldp q0, q1, [x1]
; CHECK-NEXT: mov w8, #56824
; CHECK-NEXT: mov w9, #61186		; CHECK-NEXT: mov w9, #61186
; CHECK-NEXT: movk w8, #40522, lsl #16		; CHECK-NEXT: mov w8, #56824
; CHECK-NEXT: movk w9, #29710, lsl #16		; CHECK-NEXT: movk w9, #29710, lsl #16
; CHECK-NEXT: mov x10, v0.d[1]		; CHECK-NEXT: movk w8, #40522, lsl #16
; CHECK-NEXT: fmov d2, x9		; CHECK-NEXT: ldp q0, q1, [x1]
; CHECK-NEXT: mov x11, v1.d[1]		; CHECK-NEXT: fmov d3, x9
; CHECK-NEXT: fmov d3, x8		; CHECK-NEXT: dup v2.2d, x8
; CHECK-NEXT: fmov d4, x10		; CHECK-NEXT: pmull2 v4.1q, v0.2d, v2.2d
; CHECK-NEXT: pmull v0.1q, v0.1d, v2.1d		; CHECK-NEXT: pmull v0.1q, v0.1d, v3.1d
; CHECK-NEXT: fmov d5, x11		; CHECK-NEXT: pmull2 v2.1q, v1.2d, v2.2d
; CHECK-NEXT: pmull v1.1q, v1.1d, v2.1d		; CHECK-NEXT: pmull v1.1q, v1.1d, v3.1d
; CHECK-NEXT: pmull v2.1q, v4.1d, v3.1d		; CHECK-NEXT: eor v0.16b, v0.16b, v4.16b
; CHECK-NEXT: pmull v3.1q, v5.1d, v3.1d		; CHECK-NEXT: eor v1.16b, v1.16b, v2.16b
; CHECK-NEXT: eor v0.16b, v0.16b, v2.16b
; CHECK-NEXT: eor v1.16b, v1.16b, v3.16b
; CHECK-NEXT: stp q0, q1, [x1]		; CHECK-NEXT: stp q0, q1, [x1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%3 = load <2 x i64>, ptr %1		%3 = load <2 x i64>, ptr %1
%4 = getelementptr inbounds <2 x i64>, ptr %1, i64 1		%4 = getelementptr inbounds <2 x i64>, ptr %1, i64 1
%5 = load <2 x i64>, ptr %4		%5 = load <2 x i64>, ptr %4
%6 = extractelement <2 x i64> %3, i64 1		%6 = extractelement <2 x i64> %3, i64 1
%7 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %6, i64 2655706616)		%7 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %6, i64 2655706616)
%8 = extractelement <2 x i64> %5, i64 1		%8 = extractelement <2 x i64> %5, i64 1
Show All 12 Lines	; CHECK-NEXT: ret
ret void		ret void
}		}

; One operand is higher-half of SIMD register, and the other operand is lower-half of another SIMD register.		; One operand is higher-half of SIMD register, and the other operand is lower-half of another SIMD register.
; Tests that codegen doesn't generate unnecessary moves.		; Tests that codegen doesn't generate unnecessary moves.
define void @test2(ptr %0, <2 x i64> %1, <2 x i64> %2) {		define void @test2(ptr %0, <2 x i64> %1, <2 x i64> %2) {
; CHECK-LABEL: test2:		; CHECK-LABEL: test2:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov x8, v0.d[1]		; CHECK-NEXT: dup v1.2d, v1.d[0]
; CHECK-NEXT: fmov d0, x8		; CHECK-NEXT: pmull2 v0.1q, v0.2d, v1.2d
; CHECK-NEXT: pmull v0.1q, v0.1d, v1.1d
; CHECK-NEXT: str q0, [x0]		; CHECK-NEXT: str q0, [x0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%4 = extractelement <2 x i64> %1, i64 1		%4 = extractelement <2 x i64> %1, i64 1
%5 = extractelement <2 x i64> %2, i64 0		%5 = extractelement <2 x i64> %2, i64 0
%6 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %4, i64 %5)		%6 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %4, i64 %5)
store <16 x i8> %6, ptr %0, align 16		store <16 x i8> %6, ptr %0, align 16
ret void		ret void
}		}

declare <16 x i8> @llvm.aarch64.neon.pmull64(i64, i64)		declare <16 x i8> @llvm.aarch64.neon.pmull64(i64, i64)

llvm/test/CodeGen/AArch64/pmull-ldr-merge.ll

Show All 22 Lines	; CHECK-NEXT: ret
ret void		ret void
}		}

; Operand %8 is higher-half of v2i64, and operand %7 is a scalar load.		; Operand %8 is higher-half of v2i64, and operand %7 is a scalar load.
; Tests that operand is loaded into SIMD registers directly as opposed to being loaded into GPR followed by a fmov.		; Tests that operand is loaded into SIMD registers directly as opposed to being loaded into GPR followed by a fmov.
define void @test2(ptr %0, i64 %1, i64 %2, <2 x i64> %3) {		define void @test2(ptr %0, i64 %1, i64 %2, <2 x i64> %3) {
; CHECK-LABEL: test2:		; CHECK-LABEL: test2:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov x9, v0.d[1]
; CHECK-NEXT: add x8, x0, x1, lsl #4		; CHECK-NEXT: add x8, x0, x1, lsl #4
; CHECK-NEXT: ldr d0, [x8, #8]		; CHECK-NEXT: add x9, x8, #8
; CHECK-NEXT: fmov d1, x9		; CHECK-NEXT: ld1r { v1.2d }, [x9]
; CHECK-NEXT: pmull v0.1q, v1.1d, v0.1d		; CHECK-NEXT: pmull2 v0.1q, v0.2d, v1.2d
; CHECK-NEXT: str q0, [x8]		; CHECK-NEXT: str q0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%5 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1		%5 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1
%6 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1, i64 1		%6 = getelementptr inbounds <2 x i64>, ptr %0, i64 %1, i64 1
%7 = load i64, ptr %6, align 8		%7 = load i64, ptr %6, align 8
%8 = extractelement <2 x i64> %3, i64 1		%8 = extractelement <2 x i64> %3, i64 1
%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %8, i64 %7)		%9 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %8, i64 %7)
store <16 x i8> %9, ptr %5, align 16		store <16 x i8> %9, ptr %5, align 16
Show All 19 Lines	; CHECK-NEXT: ret
ret void		ret void
}		}

; Operand %4 is the higher-half of v2i64, and operand %2 is an input parameter of i64.		; Operand %4 is the higher-half of v2i64, and operand %2 is an input parameter of i64.
; Test that %2 is duplicated into the proper lane of SIMD directly for optimal codegen.		; Test that %2 is duplicated into the proper lane of SIMD directly for optimal codegen.
define void @test4(ptr %0, <2 x i64> %1, i64 %2) {		define void @test4(ptr %0, <2 x i64> %1, i64 %2) {
; CHECK-LABEL: test4:		; CHECK-LABEL: test4:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov x8, v0.d[1]		; CHECK-NEXT: dup v1.2d, x1
; CHECK-NEXT: fmov d0, x1		; CHECK-NEXT: pmull2 v0.1q, v0.2d, v1.2d
; CHECK-NEXT: fmov d1, x8
; CHECK-NEXT: pmull v0.1q, v1.1d, v0.1d
; CHECK-NEXT: str q0, [x0]		; CHECK-NEXT: str q0, [x0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%4 = extractelement <2 x i64> %1, i64 1		%4 = extractelement <2 x i64> %1, i64 1
%5 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %4, i64 %2)		%5 = tail call <16 x i8> @llvm.aarch64.neon.pmull64(i64 %4, i64 %2)
store <16 x i8> %5, ptr %0, align 16		store <16 x i8> %5, ptr %0, align 16
ret void		ret void
}		}

declare <16 x i8> @llvm.aarch64.neon.pmull64(i64, i64)		declare <16 x i8> @llvm.aarch64.neon.pmull64(i64, i64)