D75689 turns the faddp pattern into a shuffle with vector add.
Match this new pattern in target-specific DAG combine, rather than ISel,
because legalization (for v2f32) turns it into a bit of a mess.
- extended to cover f16, f32, f64 and i64
Paths
| Differential D87231
[AArch64] Match pairwise add/fadd pattern ClosedPublic Authored by sanwou01 on Sep 7 2020, 5:00 AM.
Details Summary D75689 turns the faddp pattern into a shuffle with vector add. Match this new pattern in target-specific DAG combine, rather than ISel,
Diff Detail
Event TimelineHerald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls. · View Herald Transcript sanwou01 added reviewers: samparker, SjoerdMeijer, dmgreen, spatel, lebedev.ri, RKSimon, jgorbe.Sep 7 2020, 5:02 AM Comment Actions We had similar codegen matching problems for x86's horizontal ops and fixed them with DAGCombiner/ISel pattern matching. typedef float float2 __attribute__((ext_vector_type(2))); float faddp(float2 x) { return (__builtin_shufflevector(x, x, 1, 1) + x)[0]; } $ clang -O1 faddp.c -S -o - -target aarch64 -mllvm -disable-vector-combine faddp: // @faddp dup v1.2s, v0.s[1] fadd v0.2s, v1.2s, v0.2s ret Comment Actions Thanks @spatel . You're right that we miss that pattern, but, so does x86 currently it seems (I don't read x86 very well so I might be wrong). Using your faddp example: $ ./bin/clang -O1 ~/tmp/faddp.c -S -o - -target x86_64 -mllvm -disable-vector-combine ... faddp: # @faddp .cfi_startproc # %bb.0: # %entry pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset %rbp, -16 movq %rsp, %rbp .cfi_def_cfa_register %rbp movaps %xmm0, %xmm1 shufps $85, %xmm0, %xmm1 # xmm1 = xmm1[1,1],xmm0[1,1] addss %xmm0, %xmm1 movaps %xmm1, %xmm0 popq %rbp .cfi_def_cfa %rsp, 8 retq I did find scalarizeBinOpOfSplats in DAGCombiner but that doesn't seem to work here, nor do any of the other patterns in SimplifyVBinOp. That said, it does seem to make sense to do this in DAGCombiner, thanks for the suggestion. I'll try that. Comment Actions
Horizontal math ops are a special case for x86 (not all targets support them and even fewer prefer them for performance), so we need to make a CPU subtarget adjustment to see if that example is working: $ clang -O1 faddp.c -S -o - -target x86_64 -mllvm -disable-vector-combine -march=btver2 vhaddps %xmm0, %xmm0, %xmm0
The x86 horizontal transforms are specialized because the HW instructions themselves are weird - no sane target would ever create that functionality from scratch. :) That said, there may still be room to improve the cost models and/or usage here, but I'm not sure exactly how to adjust it. For example, we might match this pattern as a 2-way pairwise reduction? Comment Actions I think the best option would be to start generating reduction intrinsics in IR, ensure cost models are accurate for them and do all we can to coax the vectorizers to recognize them (inc. partial reduction patterns) - I've been playing whack-a-mole with improving HorizOp patterns in the backend for years now and its not fun any more :-( The plan to drop the experimental tags from the reduction intrinsics keeps getting delayed - I think due to a couple of minor issues - efficient non pow-2 type handling and inf/nan handling for fp types are the ones that @spatel reminded me about recently.
Comment Actions Thanks for the feedback. I agree that ideally we'd be generating reduction intrinsics in IR and matching that in the backends. I don't think the pairwise add can be represented with the current intrinsics though: we'd need a <2 x float> variant, or a predicated version of the <4 x float> intrinsic to do this for strict FP math, I believe. So at least for the moment I'll continue playing whack-a-mole and match the pattern in AArch64 ISel lowering. sanwou01 retitled this revision from [AArch64] ExtractElement is free when combined with pairwise add to [AArch64] Match pairwise fadd pattern.Sep 16 2020, 2:41 AM
sanwou01 retitled this revision from [AArch64] Match pairwise fadd pattern to [AArch64] Match pairwise add/fadd pattern.Sep 16 2020, 5:52 AM Comment Actions Thanks for making the extra fp16 patterns too. LGTM
This revision is now accepted and ready to land.Sep 16 2020, 9:56 AM
Revision Contents
Diff 292484 llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
llvm/lib/Target/AArch64/AArch64InstrInfo.td
llvm/test/CodeGen/AArch64/faddp-half.ll
llvm/test/CodeGen/AArch64/faddp.ll
llvm/test/CodeGen/AArch64/vecreduce-fadd.ll
|
Could this apply equally for f16/f64 as well?