It looks gcc folds vector shift intrinsic with zero shift amount from below example.
#include <arm_neon.h> inline void foo(int64x2_t a, int64x2_t b, int64_t *dst, int df) { int64x2_t df_s64 = vdupq_n_s64(df); a = vpaddq_s64(a, b); a = vshlq_s64(a, df_s64); vst1q_s64(dst, a); } void bar(int64x2_t a, int64x2_t b, int64_t *dst) { foo(a, b, dst, 0); } gcc output bar: addp v0.2d, v0.2d, v1.2d str q0, [x0] ret llvm output bar: addp v0.2d, v0.2d, v1.2d shl v0.2d, v0.2d, #0 str q0, [x0] ret
It looks llvm's AArch64 target lowers the intrinsic to target custom node in SelectionDAG and is missing to fold the custom node with zero shift amount.
With this patch, the llvm output is as below.
bar: addp v0.2d, v0.2d, v1.2d str q0, [x0] ret
Apparently this one is not correct, as the sqshlu will round the input even with a zero shift. The others look OK.