Page MenuHomePhabricator

[AArch64] Add IR intrinsics for sq(r)dmulh_lane(q)

Authored by sanwou01 on Dec 13 2019, 8:02 AM.



Currently, sqdmulh_lane and friends from the ACLE (implemented in arm_neon.h),
are represented in LLVM IR as a (by vector) sqdmulh and a vector of (repeated)
indices, like so:

%shuffle = shufflevector <4 x i16> %v, <4 x i16> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
%vqdmulh2.i = tail call <4 x i16> @llvm.aarch64.neon.sqdmulh.v4i16(<4 x i16> %a, <4 x i16> %shuffle)

When %v's values are known, the shufflevector is optimized away and we are no
longer able to select the lane variant of sqdmulh in the backend.

This defeats a (hand-coded) optimization that packs several constants into a
single vector and uses the lane intrinsics to reduce register pressure and
trade-off materialising several constants for a single vector load from the
constant pool, like so:

int16x8_t v = {2,3,4,5,6,7,8,9};
a = vqdmulh_laneq_s16(a, v, 0);
b = vqdmulh_laneq_s16(b, v, 1);
c = vqdmulh_laneq_s16(c, v, 2);
d = vqdmulh_laneq_s16(d, v, 3);

In one microbenchmark from libjpeg-turbo this accounts for a 2.5% to 4%
performance difference.

We could teach the compiler to recover the lane variants, but this would likely
require its own pass. (Alternatively, "volatile" could be used on the constants
vector, but this is a bit ugly.)

This patch instead implements the following LLVM IR intrinsics for AArch64 to
maintain the original structure through IR optmization and into instruction

  • sqdmulh_lane
  • sqdmulh_laneq
  • sqrdmulh_lane
  • sqrdmulh_laneq.

These 'lane' variants need an additional register class. The second argument
must be in the lower half of the 64-bit NEON register file, but only when
operating on i16 elements.

Note that the existing patterns for shufflevector and sqdmulh into sqdmulh_lane
(etc.) remain, so code that does not rely on NEON intrinsics to generate these
instructions is not affected.

This patch also changes clang to emit these IR intrinsics for the corresponding
NEON intrinsics (AArch64 only).

Diff Detail

Event Timeline

sanwou01 created this revision.Dec 13 2019, 8:02 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 13 2019, 8:02 AM

This makes it impossible to do a neat trick when using NEON intrinsics: one can load a number of constants using a single vector load, which are then repeatedly used to multiply whole vectors by one of the constants. This trick is used for a nice performance upside (2.5% to 4% on one microbenchmark) in libjpeg-turbo.

I'm not completely sure I follow here. The "trick" is something like the following?

int16x8_t v = {2,3,4,5,6,7,8,9};
a = vqdmulh_laneq_s16(a, v, 0);
b = vqdmulh_laneq_s16(b, v, 1);
c = vqdmulh_laneq_s16(c, v, 2);
d = vqdmulh_laneq_s16(d, v, 3);

I can see how that could be helpful. The compiler could probably be taught to recover something like the original structure, but it would probably require a dedicated pass. Or I guess you could hack the source to use "volatile", but that would be ugly.

I'm a little unhappy we're forced to introduce more intrinsics here, but it might be the best solution to avoid breaking carefully tuned code like this.

1374 ↗(On Diff #233807)

Hardcoding "64" and "128" in target-independent code here seems like a bad idea.

Can we just let both vector operands have any vector type, and reject in the backend if we see an unexpected type?

6054 ↗(On Diff #233807)

Is this related somehow?

sanwou01 marked 2 inline comments as done.Tue, Jan 28, 8:15 AM

Thanks Eli.

The "trick" is something like the following?

Yeah, that's exactly right. Your assessment of the options (dedicated pass, "volatile") matches our thinking as well. I'll update the commit message to make this a bit clearer.

1374 ↗(On Diff #233807)

Makes sense. Any type vector for both operands is certainly doable. Instruction selection will fail if you try to use a non-existent intrinsic, which is not the nicest failure mode, but probably good enough for intrinsics? Emitting the correct arm_neon.h for clang is a little less trivial, but not by too much.

6054 ↗(On Diff #233807)

This popped up when I was looking for uses of FPR128_loRegClass; it made sense to do the same for FPR64_lo. Doesn't seem essential though, so I'm happy to leave this out.

sanwou01 updated this revision to Diff 240902.Tue, Jan 28, 9:18 AM
sanwou01 retitled this revision from [AArch64] Add sq(r)dmulh_lane(q) LLVM IR intrinsics to [AArch64] Add IR intrinsics for sq(r)dmulh_lane(q).
sanwou01 edited the summary of this revision. (Show Details)

Address Eli's feedback; clarified commit message.

This revision is now accepted and ready to land.Tue, Jan 28, 12:38 PM
This revision was automatically updated to reflect the committed changes.