# [AArch64][SVE] Codegen dup_lane for dup(vector_extract)ClosedPublicActions

Authored by junparser on Mar 25 2021, 2:13 AM.

# Details

Reviewers
 david-arm sdesmalen peterwaller-arm paulwalker-arm efriedma
Commits
rG1af373c67369: [AArch64][SVE] Codegen dup_lane for dup(vector_extract)
Summary

As is title, this patch adds the pattern dup(vector_extract(vec, idx)) which matches dup_lane instruction: DUP <Zd>.<T>, <Zn>.<T>[<imm>]

TestPlan: check-llvm

# Unit TestsFailedView All

TimeTest
70 msx64 debian > LLVM.Transforms/SimpleLoopUnswitch::partial-unswitch.ll
Script: -- : 'RUN: at line 2'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -passes='loop-mssa(unswitch<nontrivial>),verify<loops>' -S < /mnt/disks/ssd0/agent/llvm-project/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/Transforms/SimpleLoopUnswitch/partial-unswitch.ll
90 msx64 windows > LLVM.Transforms/SimpleLoopUnswitch::partial-unswitch.ll
Script: -- : 'RUN: at line 2'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\opt.exe -passes='loop-mssa(unswitch<nontrivial>),verify<loops>' -S < C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\Transforms\SimpleLoopUnswitch\partial-unswitch.ll | c:\ws\w16-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\Transforms\SimpleLoopUnswitch\partial-unswitch.ll

### Event Timeline

junparser created this revision.Mar 25 2021, 2:13 AM
junparser requested review of this revision.Mar 25 2021, 2:13 AM
Herald added a project: Restricted Project. Mar 25 2021, 2:13 AM
Herald added a subscriber: llvm-commits.
junparser edited the summary of this revision. (Show Details)Mar 25 2021, 2:14 AM
junparser edited the summary of this revision. (Show Details)

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

Actually, it is an isel issue, The svdup_lane in title is just where I find this issue.
1), there is no intrinsic direct map to dup (index) instruction, while vector_extract may lower with dup (index), it is not enough. 2) svdup_lane acle intrinsic generates as sve.dup.x + sve.tbl in llvm ir, and covert to AArch64tbl ( ... splat_vector(..., constant)) , then lower to AArch64tbl ( ... DUP(..., imm)). This is the pattern this patch try to match.

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

Actually, it is an isel issue, The svdup_lane in title is just where I find this issue.
1), there is no intrinsic direct map to dup (index) instruction, while vector_extract may lower with dup (index), it is not enough. 2) svdup_lane acle intrinsic generates as sve.dup.x + sve.tbl in llvm ir, and covert to AArch64tbl ( ... splat_vector(..., constant)) , then lower to AArch64tbl ( ... DUP(..., imm)). This is the pattern this patch try to match.

Sure, I understand that. But the problem of good code generation to duplicate a vector lane seems like a generic one and thus we can solve that first. Then we can canonicalise ACLE related intrinsic patterns to stock LLVM IR and thus not require multiple solutions to the same problem. In the future this will also have the benefit of allowing other stock LLVM transforms to kick in that would otherwise not understand the SVE specific intrinsics.

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

Actually, it is an isel issue, The svdup_lane in title is just where I find this issue.
1), there is no intrinsic direct map to dup (index) instruction, while vector_extract may lower with dup (index), it is not enough. 2) svdup_lane acle intrinsic generates as sve.dup.x + sve.tbl in llvm ir, and covert to AArch64tbl ( ... splat_vector(..., constant)) , then lower to AArch64tbl ( ... DUP(..., imm)). This is the pattern this patch try to match.

Sure, I understand that. But the problem of good code generation to duplicate a vector lane seems like a generic one and thus we can solve that first. Then we can canonicalise ACLE related intrinsic patterns to stock LLVM IR and thus not require multiple solutions to the same problem. In the future this will also have the benefit of allowing other stock LLVM transforms to kick in that would otherwise not understand the SVE specific intrinsics.

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

Actually, it is an isel issue, The svdup_lane in title is just where I find this issue.
1), there is no intrinsic direct map to dup (index) instruction, while vector_extract may lower with dup (index), it is not enough. 2) svdup_lane acle intrinsic generates as sve.dup.x + sve.tbl in llvm ir, and covert to AArch64tbl ( ... splat_vector(..., constant)) , then lower to AArch64tbl ( ... DUP(..., imm)). This is the pattern this patch try to match.

Sure, I understand that. But the problem of good code generation to duplicate a vector lane seems like a generic one and thus we can solve that first. Then we can canonicalise ACLE related intrinsic patterns to stock LLVM IR and thus not require multiple solutions to the same problem. In the future this will also have the benefit of allowing other stock LLVM transforms to kick in that would otherwise not understand the SVE specific intrinsics.

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector.

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector.

I feel the higher up the chain/earlier we do this the better. Outside of the ACLE intrinsics I wouldn't expect scalable AArch64ISD::TBL to be created unless that's exactly what the code generator wants. It's worth highlighting that this sort of SVE ACLE intrinsics -> LLVM IR transform will not be an isolated case. We deliberately created intrinsics even for common transforms so that we could minimise use of stock LLVM IR and thus limit failures due to missing scalable vector support. As LLVM matures I would expect us to utilise stock LLVM IR more and more. For example converting dup's to shufflevector, ptrue all predicated operations to normal LLVM bin ops...etc.

That said, if you think PerformDAGCombine (presumable performIntrinsicCombine) is the best place today then fine. It can easily be moved up the chain when we're more comfortable.

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector.

I feel the higher up the chain/earlier we do this the better. Outside of the ACLE intrinsics I wouldn't expect scalable AArch64ISD::TBL to be created unless that's exactly what the code generator wants. It's worth highlighting that this sort of SVE ACLE intrinsics -> LLVM IR transform will not be an isolated case. We deliberately created intrinsics even for common transforms so that we could minimise use of stock LLVM IR and thus limit failures due to missing scalable vector support. As LLVM matures I would expect us to utilise stock LLVM IR more and more. For example converting dup's to shufflevector, ptrue all predicated operations to normal LLVM bin ops...etc.

I agree, we also wish to use llvm scalar vector ir as well. I'll extend tbl pattern in SVEIntrinsicOpts.cpp as standalone patch, and then update this one.

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector.

I feel the higher up the chain/earlier we do this the better. Outside of the ACLE intrinsics I wouldn't expect scalable AArch64ISD::TBL to be created unless that's exactly what the code generator wants. It's worth highlighting that this sort of SVE ACLE intrinsics -> LLVM IR transform will not be an isolated case. We deliberately created intrinsics even for common transforms so that we could minimise use of stock LLVM IR and thus limit failures due to missing scalable vector support. As LLVM matures I would expect us to utilise stock LLVM IR more and more. For example converting dup's to shufflevector, ptrue all predicated operations to normal LLVM bin ops...etc.

That said, if you think PerformDAGCombine (presumable performIntrinsicCombine) is the best place today then fine. It can easily be moved up the chain when we're more comfortable.

Hi Paul,
Excuse me, I am new to LLVM/backend, one question is: What does "stock LLVM IR" mean(refer to) in above comment?

As for the patch, I am trying to understand the issue, do you suggest we should first introduce DUP_LANE pattern similar to SVDOT_LANE_S so that clang CodeGen doesn't generate dup.x when possible?

Thanks

Hi @paulwalker-arm I'll update this patch based on D99412 later.

Excuse me, I am new to LLVM/backend, one question is: What does "stock LLVM IR" mean(refer to) in above comment?

By stock LLVM IR I'm referring to the LLVM instructions as defined by the LandRef plus non-target specific intrinsics.

As for the patch, I am trying to understand the issue, do you suggest we should first introduce DUP_LANE pattern similar to SVDOT_LANE_S so that clang CodeGen doesn't generate dup.x when possible?

I'm not sure I fully understand your question but in general when it comes to code generation I'm trying to ensure where possible that we have a canonicalised representation so that we minimise the number of patterns (IR or DAG) that end up resolving to the same instruction.

Excuse me, I am new to LLVM/backend, one question is: What does "stock LLVM IR" mean(refer to) in above comment?

By stock LLVM IR I'm referring to the LLVM instructions as defined by the LandRef plus non-target specific intrinsics.

As for the patch, I am trying to understand the issue, do you suggest we should first introduce DUP_LANE pattern similar to SVDOT_LANE_S so that clang CodeGen doesn't generate dup.x when possible?

I'm not sure I fully understand your question but in general when it comes to code generation I'm trying to ensure where possible that we have a canonicalised representation so that we minimise the number of patterns (IR or DAG) that end up resolving to the same instruction.

I was thinking to introduce below like sve intrinsic pattern so that clang CodeGen generates dup_index directly for const imm_index.
def SVDUP_LANE_IMM : SInst<"svdup_lane[_{d}]", "ddi", "csilUcUsUiUlhfd", MergeNone, "aarch64_sve_dup_index", [], [ImmCheck<1, ImmCheck0_31, 1>]>;
However, this seems impossible because this BuiltIn is the same as SVDUP_LANE and results in compilation error.

junparser retitled this revision from [AArch64][SVE] Simplify codegen of svdup_lane intrinsic to [AArch64][SVE] Codegen dup_lane for dup(vector_extract).Mar 28 2021, 11:21 PM
sdesmalen added inline comments.Mar 29 2021, 4:06 AM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗(On Diff #333771)

This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. <elt0, _, elt1, _, .. >. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.

paulwalker-arm added inline comments.Mar 29 2021, 4:43 AM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗(On Diff #333771)

While logically true I think in practice you'd rewrite the patten so the instruction's element type matched that of the "packed" vector associated with the dag result's element count (i.e. D for nxv2, S for nxv4).

So in this instance something like:

def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_s:$index)))),
(DUP_ZZI_S ZPR:$vec, sve_elm_idx_extdup_s:$index)>;

So in essense all nxv4 results are considered to be duplicating floats, with all nxv2 results the result of duplicating doubles.

Is it possible to move the patterns into the multiclass for sve_int_perm_dup_i?

junparser added inline comments.Mar 29 2021, 4:54 AM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗(On Diff #333771)

This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. <elt0, _, elt1, _, .. >. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.

This is quiet different than what I thought, for nxv4f16, I thought the upper 64bit should be empty. Where can i find these rules? I haven't see such in anywhere

624 ↗(On Diff #333771)

OK, I'll move them to sve_int_perm_dup_i

junparser updated this revision to Diff 333826.Mar 29 2021, 5:21 AM

sdesmalen added inline comments.Mar 29 2021, 5:27 AM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗(On Diff #333771)

We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like <11110000 | ... | 11110000 >, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on nxv4f16 can use the predicate that would be used for a nxv4f32, and thus disables every other lane.

junparser added inline comments.Mar 29 2021, 5:33 AM
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗(On Diff #333771)

We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like <11110000 | ... | 11110000 >, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on nxv4f16 can use the predicate that would be used for a nxv4f32, and thus disables every other lane.

624 ↗(On Diff #333771)

Thanks for explain this!

paulwalker-arm accepted this revision.Mar 29 2021, 7:45 AM
This revision is now accepted and ready to land.Mar 29 2021, 7:45 AM

Just wanted to add that the patch summary no longer matches the intent of the patch.

junparser edited the summary of this revision. (Show Details)Mar 29 2021, 6:51 PM
This revision was landed with ongoing or failed builds.Mar 29 2021, 7:35 PM
This revision was automatically updated to reflect the committed changes.