Page MenuHomePhabricator

[AArch64] Match pairwise add/fadd pattern
ClosedPublic

Authored by sanwou01 on Sep 7 2020, 5:00 AM.

Details

Summary

D75689 turns the faddp pattern into a shuffle with vector add.

Match this new pattern in target-specific DAG combine, rather than ISel,
because legalization (for v2f32) turns it into a bit of a mess.

  • extended to cover f16, f32, f64 and i64

Diff Detail

Event Timeline

sanwou01 created this revision.Sep 7 2020, 5:00 AM
sanwou01 requested review of this revision.Sep 7 2020, 5:00 AM

We had similar codegen matching problems for x86's horizontal ops and fixed them with DAGCombiner/ISel pattern matching.
If you don't do that, then I think you're still going to miss faddp opportunities if the source/IR is already in the form with a shuffle.
Example:

typedef float float2 __attribute__((ext_vector_type(2)));

float faddp(float2 x) {
  return (__builtin_shufflevector(x, x, 1, 1) + x)[0];
}
$ clang -O1 faddp.c -S -o - -target aarch64 -mllvm -disable-vector-combine 
faddp:                                  // @faddp
	dup	v1.2s, v0.s[1]
	fadd	v0.2s, v1.2s, v0.2s
	ret

Thanks @spatel . You're right that we miss that pattern, but, so does x86 currently it seems (I don't read x86 very well so I might be wrong). Using your faddp example:

$  ./bin/clang -O1 ~/tmp/faddp.c -S -o - -target x86_64 -mllvm -disable-vector-combine
...
faddp:                                  # @faddp
        .cfi_startproc
# %bb.0:                                # %entry
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movaps  %xmm0, %xmm1
        shufps  $85, %xmm0, %xmm1               # xmm1 = xmm1[1,1],xmm0[1,1]
        addss   %xmm0, %xmm1
        movaps  %xmm1, %xmm0
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq

I did find scalarizeBinOpOfSplats in DAGCombiner but that doesn't seem to work here, nor do any of the other patterns in SimplifyVBinOp.

That said, it does seem to make sense to do this in DAGCombiner, thanks for the suggestion. I'll try that.

spatel added a comment.Sep 8 2020, 5:21 AM

Thanks @spatel . You're right that we miss that pattern, but, so does x86 currently it seems (I don't read x86 very well so I might be wrong).

Horizontal math ops are a special case for x86 (not all targets support them and even fewer prefer them for performance), so we need to make a CPU subtarget adjustment to see if that example is working:

$ clang -O1 faddp.c -S -o - -target x86_64 -mllvm -disable-vector-combine -march=btver2
  vhaddps	%xmm0, %xmm0, %xmm0

I did find scalarizeBinOpOfSplats in DAGCombiner but that doesn't seem to work here, nor do any of the other patterns in SimplifyVBinOp.

The x86 horizontal transforms are specialized because the HW instructions themselves are weird - no sane target would ever create that functionality from scratch. :)
See "LowerToHorizontalOp" and "lowerAddSubToHorizontalOp" in X86ISelLowering.cpp.

That said, there may still be room to improve the cost models and/or usage here, but I'm not sure exactly how to adjust it. For example, we might match this pattern as a 2-way pairwise reduction?

I think the best option would be to start generating reduction intrinsics in IR, ensure cost models are accurate for them and do all we can to coax the vectorizers to recognize them (inc. partial reduction patterns) - I've been playing whack-a-mole with improving HorizOp patterns in the backend for years now and its not fun any more :-(

The plan to drop the experimental tags from the reduction intrinsics keeps getting delayed - I think due to a couple of minor issues - efficient non pow-2 type handling and inf/nan handling for fp types are the ones that @spatel reminded me about recently.

RKSimon added inline comments.Sep 9 2020, 3:35 AM
llvm/test/CodeGen/AArch64/combine-vectors-faddp.ll
2 ↗(On Diff #290249)

VectorCombine tests should be put in llvm\test\Transforms\VectorCombine

Thanks for the feedback. I agree that ideally we'd be generating reduction intrinsics in IR and matching that in the backends. I don't think the pairwise add can be represented with the current intrinsics though: we'd need a <2 x float> variant, or a predicated version of the <4 x float> intrinsic to do this for strict FP math, I believe.

So at least for the moment I'll continue playing whack-a-mole and match the pattern in AArch64 ISel lowering.

sanwou01 updated this revision to Diff 292158.Sep 16 2020, 2:41 AM

Rework to match faddp in AArch64 ISel lowering

sanwou01 retitled this revision from [AArch64] ExtractElement is free when combined with pairwise add to [AArch64] Match pairwise fadd pattern.Sep 16 2020, 2:41 AM
sanwou01 edited the summary of this revision. (Show Details)
dmgreen added inline comments.Sep 16 2020, 3:06 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11598

Could this apply equally for f16/f64 as well?

sanwou01 added inline comments.Sep 16 2020, 3:29 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11598

I think so. Looks like we're missing the f16 FADDP pattern in ISel so might as well add that too.

Similar for i64 ADDP actually.

sanwou01 updated this revision to Diff 292195.Sep 16 2020, 5:50 AM
sanwou01 edited the summary of this revision. (Show Details)

Extend to f16, f32, f64 and i64

sanwou01 retitled this revision from [AArch64] Match pairwise fadd pattern to [AArch64] Match pairwise add/fadd pattern.Sep 16 2020, 5:52 AM
sanwou01 edited the summary of this revision. (Show Details)
dmgreen accepted this revision.Sep 16 2020, 9:56 AM

Thanks for making the extra fp16 patterns too. LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11627

SDLoc(N) -> DL

This revision is now accepted and ready to land.Sep 16 2020, 9:56 AM
sanwou01 updated this revision to Diff 292484.Sep 17 2020, 6:56 AM

Fix for when there is no fp16 faddp + testing

sanwou01 marked an inline comment as done.Sep 17 2020, 7:00 AM
sanwou01 added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11627

Huh, I missed those, thanks! This'll be fixed when I land this change.

sanwou01 marked 2 inline comments as done.Sep 17 2020, 7:06 AM