As is title, this patch adds the pattern dup(vector_extract(vec, idx)) which matches dup_lane instruction: DUP <Zd>.<T>, <Zn>.<T>[<imm>]
TestPlan: check-llvm
Differential D99324
[AArch64][SVE] Codegen dup_lane for dup(vector_extract) junparser on Mar 25 2021, 2:13 AM. Authored by
Details As is title, this patch adds the pattern dup(vector_extract(vec, idx)) which matches dup_lane instruction: DUP <Zd>.<T>, <Zn>.<T>[<imm>] TestPlan: check-llvm
Diff Detail
Event TimelineComment Actions I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation. Comment Actions Actually, it is an isel issue, The svdup_lane in title is just where I find this issue. Comment Actions Sure, I understand that. But the problem of good code generation to duplicate a vector lane seems like a generic one and thus we can solve that first. Then we can canonicalise ACLE related intrinsic patterns to stock LLVM IR and thus not require multiple solutions to the same problem. In the future this will also have the benefit of allowing other stock LLVM transforms to kick in that would otherwise not understand the SVE specific intrinsics. Comment Actions OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node? Comment Actions The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector. Comment Actions
I feel the higher up the chain/earlier we do this the better. Outside of the ACLE intrinsics I wouldn't expect scalable AArch64ISD::TBL to be created unless that's exactly what the code generator wants. It's worth highlighting that this sort of SVE ACLE intrinsics -> LLVM IR transform will not be an isolated case. We deliberately created intrinsics even for common transforms so that we could minimise use of stock LLVM IR and thus limit failures due to missing scalable vector support. As LLVM matures I would expect us to utilise stock LLVM IR more and more. For example converting dup's to shufflevector, ptrue all predicated operations to normal LLVM bin ops...etc. That said, if you think PerformDAGCombine (presumable performIntrinsicCombine) is the best place today then fine. It can easily be moved up the chain when we're more comfortable. Comment Actions I agree, we also wish to use llvm scalar vector ir as well. I'll extend tbl pattern in SVEIntrinsicOpts.cpp as standalone patch, and then update this one. Comment Actions Hi Paul, As for the patch, I am trying to understand the issue, do you suggest we should first introduce DUP_LANE pattern similar to SVDOT_LANE_S so that clang CodeGen doesn't generate dup.x when possible? Thanks Comment Actions By stock LLVM IR I'm referring to the LLVM instructions as defined by the LandRef plus non-target specific intrinsics.
I'm not sure I fully understand your question but in general when it comes to code generation I'm trying to ensure where possible that we have a canonicalised representation so that we minimise the number of patterns (IR or DAG) that end up resolving to the same instruction. Comment Actions I was thinking to introduce below like sve intrinsic pattern so that clang CodeGen generates dup_index directly for const imm_index.
|