Page MenuHomePhabricator

[AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new SDNode.
ClosedPublic

Authored by mingmingl on Aug 2 2022, 11:25 PM.

Details

Summary

How:

  1. Add AArch64ISD::PMULL SDNode, and extend aarch64_neon_pmull intrinsic tablegen pattern for this SDNode.
  2. For aarch64_neon_pmull64, canonicalize i64 operands to v1i64 vectors during legalization.
  3. For {aarch64_neon_pmull, aarch64_neon_pmull64}, combine intrinsic to SDNode. in dag-combiner

Why

  1. aarch64_neon_pmull64 is the motivating use case. Adding the SDNode makes it easier to canonicalize i64 inputs to vector inputs. Vector inputs carries lane information, which helps dag-combiner to combine nodes (e.g. rewrite to a better node to prepare for instruction selection); as a result, instruction-selection could emit instructions that use higher-half inputs in place (i.e., no need to move lane 1 content to lane 0).
  2. For aarch64_neon_pmull, using the SDNode is NFC, yet without this we have to move the definition of {PMULLv1i64, PMULLv2i64} out of its current group of records without gains.

Diff Detail

Event Timeline

mingmingl created this revision.Aug 2 2022, 11:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 11:25 PM
mingmingl requested review of this revision.Aug 2 2022, 11:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 11:25 PM
mingmingl edited the summary of this revision. (Show Details)Aug 2 2022, 11:27 PM
mingmingl added reviewers: efriedma, dmgreen.

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

mingmingl planned changes to this revision.Aug 3 2022, 10:53 PM

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

Thanks for pointing out tryCombineLongOpWithDup (wasn't aware of this approach before); it looks reasonable to me to follow the dag-combiner path. Plan changes.

mingmingl edited the summary of this revision. (Show Details)

Updated the tablegen pattern to generate DUPv2i64gpr directly, with the motivation in inline replies.

mingmingl added a comment.EditedAug 14 2022, 12:04 AM

The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a dup(load could be a ld1r. And I'm pretty sure a dup v1, x1 should be better than a 'fmov;dup' pair.

There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?

Thanks for pointing out tryCombineLongOpWithDup (wasn't aware of this approach before); it looks reasonable to me to follow the dag-combiner path. Plan changes.

After a study (details elaborated below), update the tablegen pattern to use DUPv2i64gpr to dup from GPR directly.

Some contexts on why updating tablegen pattern:

  1. Ultimately, we want dup V128, GPR64 as opposed to fmov V64 GPR (or dup V64 GPR) followed by a duplane64 V128, 0; so tablegen pattern use DUPv2i64gpr (dup from main to SIMD register) to express the intention.
  2. Alternatively, this two-step approach works
    • First step is to transform GPR64 to extractelt (duplane64 (scalar_to_vector V128, GPR64), 0), 1 (in AArch64ISelLowering.cpp) so pmull2 is used, and second step is to transform (duplane64 (scalar_to_vector V128, GPR64), 0) to dup V128, GPR (in AArch64InstrInfo.td)
    • https://reviews.llvm.org/differential/diff/452482/ shows two-step approach, with {cpp, td, test} changes. Note the tests are updated in the same way as this patch.
  3. Both this patch and the alternative option in 2, retain duplane64 throughout the legalize/combine steps, until instruction selection really begins. This is because extactelt(dup x) will be combined into x (code where this happens)
  4. tryCombineLongOpWithDup is suitable when operands are vector, but intrinsic llvm.aarch64.neon.pmull64 has scalar operands.
    • Vectorizing a scalar operand during dag-combiner is not preferred here, since new operands of a combined node won't be appended to worklist for further combination (only the combined node and its users will be added to the worklist).
    • Even if new operands could be added to worklist for further combination opportunities, we want to retain duplane64 (rather than combining it into dup) as mentioned in 3.

(Sorry for the delay; I did some study for the findings and then took a one-week vacation)

mingmingl retitled this revision from [AArch64] Add a tablegen pattern for aarch64.neon.pmull64 to [AArch64] Add a tablegen pattern to transform duplane(scalar_to_vector(x),0) to dup(x), and vectorize scalar operands for aarch64.neon.pmull64 intrinsic.
mingmingl edited the summary of this revision. (Show Details)

Decide to go with a tablegen pattern for dup, and vectorization for aarch64.neon.pmumll64 intrinisc.

  • The alternative is a tablegen pattern for instrinsic [1]. This alternative is sub-optimal since it create new nodes (to dup from GPR) in the final instruction stage, and missing chances of combination in dag-combiners.
    • For example, if i64 is the lower-half of SIMD registers, we want to dup from lane, directly rather than generating a fmov (from SIMD lane 0 to GPR) followed by a dup (from GPR to all lanes of SIMD). test2 in aarch64-pmull2 is the corresponding test for this.

[1]

def : Pat<(int_aarch64_neon_pmull64 (extractelt (v2i64 V128:$Rn), (i64 1)),
                                    GPR64:$Rm),
          (PMULLv2i64 V128:$Rn, (v2i64 (DUPv2i64gpr GPR64:$Rm)))>;

Can you upload with full context?

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future. It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

mingmingl planned changes to this revision.Aug 15 2022, 10:02 PM

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Re aarch64_neon_pmull64 relies on i64 inputs, yeah this means [1] won't compile, and [2] will require a v1i64 -> i64 change, which is not desired (details in [3])

[1]

// with "Type set is empty for each HW mode" error from tablegen 
def : Pat<(int_aarch64_neon_pmull64 (v1i64 (extract_subvector V128:$Rn, (i64 1))),
                                    (v1i64 (extract_subvector V128:$Rm, (i64 1)))),
                (PMULLv2i64 V128:$Rn, V128:$Rm)>

[2]

def : Pat<(int_aarch64_neon_pmull64 (i64 (bitconvert (v1i64 (extract_subvector V128:$Rn, (i64 1))))),
                                    (i64 (bitconvert (v1i64 (extract_subvector V128:$Rm, (i64 1)))))),
          (PMULLv2i64 V128:$Rn, V128:$Rm)>;

[3]
First, v1i64->i64 reverses i64->v1i64, so having both of them either gives indefinite "combine->legalization->combine" loop ( nodes created by dag-combiner will be re-legalized ) , or makes the steps super stateful and complex.
Secondly, PreprocessISelDAG, the per-target hook right before ISel begins, isn't implemented in aarch64 backend. And implementing it (for this use case) comes with a non-eligible cost to iterate all instructions.

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

I meant update the patch with -U9999999 as per https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface. It makes the reviews easier to read.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Yeah, a AArch64ISD::PMULL64 node. It can hopefully make the intent that the inputs are i64's in vector registers easier to specify.

mingmingl retitled this revision from [AArch64] Add a tablegen pattern to transform duplane(scalar_to_vector(x),0) to dup(x), and vectorize scalar operands for aarch64.neon.pmull64 intrinsic to [AArch64] Change aarch64_neon_pmull{,64} intrinsic ISel through a new SDNode..
mingmingl edited the summary of this revision. (Show Details)

Add SDNode pmull and use it for both aarch64_neon_pmull and aarch64_neon_pmull64.

Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

I meant update the patch with -U9999999 as per https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface. It makes the reviews easier to read.

Got it. Used "git show HEAD -U999999 > mypatch.patch" this time.

If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of duplane(scalar_to_vector)->dup. Thanks for the catch!

It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node" does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Yeah, a AArch64ISD::PMULL64 node. It can hopefully make the intent that the inputs are i64's in vector registers easier to specify.

Done with two things to mention

  1. While only one operand is a higher-half, the other operand is legalized with DUP.
    • If the other operand is a lower-half, DUPLANE64 is better. Use a FIXME assuming aarch64_neon_pmull64(higher-half, lower-half) is not common. This is feasible to fix (using Optional<uint64_t> to represent lane1, lane0 or not SIMD) but just add a little more code complexity.
  2. It seems to me that ld1r requires the base address to be a GPR so address with offset becomes another add instruction; while with ldr (small) offsets could be folded into instruction itself. So use ISD::SCALAR_TO_VECTOR for i64->v1i64 rather than ISD::AArch64Dup.

Let me know if I miss anything in 1) or 2). Thanks!

mingmingl updated this revision to Diff 453229.Aug 17 2022, 1:17 AM

Update after precommit tests are simplified so that diff is clearer.

dmgreen added inline comments.Aug 17 2022, 9:47 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4614–4616

Return a AArch64ISD::PMULL here...

4634

Will the type ever not be a i64?

16696–16697

.. then I don't think this is needed, we have already "lowered" pmull64.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
674

Add a newline like the others below to keep the line-length down. I don't think there is a strict line length in these files, but we try to keep the lines getting too long.

mingmingl marked 4 inline comments as done.
mingmingl edited the summary of this revision. (Show Details)

Update include:

  1. address comments
  2. took the liberty to detect both lane 1 and lane 0 (previously only lane 1) so duplane64 (as opposed to dup-from-gpr) is used for lane 0, as a small optimization.

Also at this point, test4 in pmull-ldr-merge.ll is not about ldr, so aarch64-pmull.ll is a better place. I could send an NFC change later , so as not to frequently updating the precommit test patch ( an NFC update to precommit test is not obvious in the UI)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4634

By lowering aarch64_neon_pmull64 to AArch64ISD::PMULL around line 4583 below (as suggested), the type would always be i64. So added an assert -> without intrinsic->AArch64ISD::PMULL lowering, the new intrinsic SDNode will be added to node list of the DAG (code) for legalization; by then the type is not i64.

mingmingl updated this revision to Diff 453798.Aug 18 2022, 2:52 PM

Changes

  1. Fix a subtle C++ bug in static lambda TryVectorizeOperand-> the helper function is declared as lambda to limit scope (no need to sanity check parameters), but it should really not capture variables (that could change per invocation). This issue just occurred to me when looking at the codebase.
  2. In tablegen pattern SIMDDifferentThreeVectorBD, remove default parameter (null) for OpNode since there isn't a use case for default parameter now.
dmgreen accepted this revision.Aug 19 2022, 2:06 AM

Thanks. LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4576

Remove static from this variable. I'm pretty sure it only really applies to the TryVectorizeOperand variable, not anything to do with the lambda, so needn't be static.

This revision is now accepted and ready to land.Aug 19 2022, 2:06 AM
mingmingl marked an inline comment as done.

Resolve comments.

Thanks. LGTM

Thanks for consistently steering the patch in the right direction!

I learnt I should have asked for review of D131045 in the first place when it came to commit time. It's better late than never, so did that just now.

This revision was landed with ongoing or failed builds.Aug 19 2022, 1:18 PM
This revision was automatically updated to reflect the committed changes.