This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Add FP16 vector insert/extract patterns
ClosedPublic

Authored by miyuki on May 30 2019, 4:42 AM.

Download Raw Diff

Details

Reviewers

eli.friedman
efriedma

Commits

rG08da01b49648: [ARM] Add FP16 vector insert/extract patterns
rL362482: [ARM] Add FP16 vector insert/extract patterns

Summary

This change adds two FP16 extraction and two insertion patterns
(one per possible vector length).
Extractions are handled by copying a Q/D register into one of VFP2
class registers, where single FP32 sub-registers can be accessed. Then
the extraction of even lanes are simple sub-register extractions
(because we don't care about the top parts of registers for FP16
operations). Odd lanes need an additional VMOVX instruction.

Unfortunately, insertions cannot be handled in the same way, because:

There is no instruction to insert FP16 into an even lane (VINS only works with odd lanes)
The patterns for odd lanes will have a form of a DAG (not a tree), and will not be implementable in pure tablegen

Because of this insertions are handled in the same way as 16-bit
integer insertions (with conversions between FP registers and GPRs
using VMOVHR instructions).

Without these patterns the ARM backend would sometimes fail during
instruction selection.

This patch also adds patterns which combine:

an FP16 element extraction and a store into a single VST1 instruction
an FP16 load and insertion into a single VLD1 instruction

Diff Detail

Repository: rL LLVM

Event Timeline

miyuki created this revision.May 30 2019, 4:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 30 2019, 4:42 AM

Herald added subscribers: hiraditya, kristof.beyls, javed.absar. · View Herald Transcript

dnsampaio added a subscriber: dnsampaio.May 30 2019, 4:48 AM

Could these be done without having to move to the GPR register file and back? The v8.2A FP16 extension added the VINS and VMOVX instructions which move between the top and bottom halves of half of S registers, which look ideal for this.

Extracting an element clearly shouldn't go through GPRs, yes; it can always be done as a no-op copy or a VMOVX.

For inserting an element, I'm not sure there are better sequences in all cases.

Changed extraction patterns to avoid using GPRs as intermediate registers.

LGTM

We could possibly use a custom inserter to generate the vins sequence, but it would probably involve some benchmarking to make sure there aren't any unexpected performance penalties due to the weird register usage. So I'm happy to put that off for now.

(On a side-note, I think you can insert a float into element zero of a vector with two vext instructions, which is the same number of instructions, but maybe lower latency.)

This revision is now accepted and ready to land.May 31 2019, 1:58 PM

Closed by commit rL362482: [ARM] Add FP16 vector insert/extract patterns (authored by miyuki). · Explain WhyJun 4 2019, 2:37 AM

This revision was automatically updated to reflect the committed changes.

In D62651#1525589, @efriedma wrote:

LGTM

We could possibly use a custom inserter to generate the vins sequence, but it would probably involve some benchmarking to make sure there aren't any unexpected performance penalties due to the weird register usage. So I'm happy to put that off for now.

(On a side-note, I think you can insert a float into element zero of a vector with two vext instructions, which is the same number of instructions, but maybe lower latency.)

Thanks for a suggestion. I've raised a ticket in our internal issue tracking system, so that we can return to it later.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

ARM/

ARMInstrNEON.td

53 lines

test/

CodeGen/

ARM/

fp16-insert-extract.ll

72 lines

fp16-vldlane-vstlane.ll

56 lines

Diff 202885

llvm/trunk/lib/Target/ARM/ARMInstrNEON.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,111 Lines • ▼ Show 20 Lines	def VLD1LNd32 : VLD1LN32<0b1000, {?,0,?,?}, "32", v2i32, load> {
let Inst{7} = lane{0};		let Inst{7} = lane{0};
let Inst{5-4} = Rn{5-4};		let Inst{5-4} = Rn{5-4};
}		}

def VLD1LNq8Pseudo : VLD1QLNPseudo<v16i8, extloadi8>;		def VLD1LNq8Pseudo : VLD1QLNPseudo<v16i8, extloadi8>;
def VLD1LNq16Pseudo : VLD1QLNPseudo<v8i16, extloadi16>;		def VLD1LNq16Pseudo : VLD1QLNPseudo<v8i16, extloadi16>;
def VLD1LNq32Pseudo : VLD1QLNPseudo<v4i32, load>;		def VLD1LNq32Pseudo : VLD1QLNPseudo<v4i32, load>;

		def : Pat<(vector_insert (v4f16 DPR:$src),
		(f16 (load addrmode6:$addr)), imm:$lane),
		(VLD1LNd16 addrmode6:$addr, DPR:$src, imm:$lane)>;
		def : Pat<(vector_insert (v8f16 QPR:$src),
		(f16 (load addrmode6:$addr)), imm:$lane),
		(VLD1LNq16Pseudo addrmode6:$addr, QPR:$src, imm:$lane)>;
def : Pat<(vector_insert (v2f32 DPR:$src),		def : Pat<(vector_insert (v2f32 DPR:$src),
(f32 (load addrmode6:$addr)), imm:$lane),		(f32 (load addrmode6:$addr)), imm:$lane),
(VLD1LNd32 addrmode6:$addr, DPR:$src, imm:$lane)>;		(VLD1LNd32 addrmode6:$addr, DPR:$src, imm:$lane)>;
def : Pat<(vector_insert (v4f32 QPR:$src),		def : Pat<(vector_insert (v4f32 QPR:$src),
(f32 (load addrmode6:$addr)), imm:$lane),		(f32 (load addrmode6:$addr)), imm:$lane),
(VLD1LNq32Pseudo addrmode6:$addr, QPR:$src, imm:$lane)>;		(VLD1LNq32Pseudo addrmode6:$addr, QPR:$src, imm:$lane)>;

// A 64-bit subvector insert to the first 128-bit vector position		// A 64-bit subvector insert to the first 128-bit vector position
▲ Show 20 Lines • Show All 1,042 Lines • ▼ Show 20 Lines
def VST1LNq16Pseudo : VST1QLNPseudo<v8i16, truncstorei16, NEONvgetlaneu>;		def VST1LNq16Pseudo : VST1QLNPseudo<v8i16, truncstorei16, NEONvgetlaneu>;
def VST1LNq32Pseudo : VST1QLNPseudo<v4i32, store, extractelt>;		def VST1LNq32Pseudo : VST1QLNPseudo<v4i32, store, extractelt>;

def : Pat<(store (extractelt (v2f32 DPR:$src), imm:$lane), addrmode6:$addr),		def : Pat<(store (extractelt (v2f32 DPR:$src), imm:$lane), addrmode6:$addr),
(VST1LNd32 addrmode6:$addr, DPR:$src, imm:$lane)>;		(VST1LNd32 addrmode6:$addr, DPR:$src, imm:$lane)>;
def : Pat<(store (extractelt (v4f32 QPR:$src), imm:$lane), addrmode6:$addr),		def : Pat<(store (extractelt (v4f32 QPR:$src), imm:$lane), addrmode6:$addr),
(VST1LNq32Pseudo addrmode6:$addr, QPR:$src, imm:$lane)>;		(VST1LNq32Pseudo addrmode6:$addr, QPR:$src, imm:$lane)>;

		def : Pat<(store (extractelt (v4f16 DPR:$src), imm:$lane), addrmode6:$addr),
		(VST1LNd16 addrmode6:$addr, DPR:$src, imm:$lane)>;
		def : Pat<(store (extractelt (v8f16 QPR:$src), imm:$lane), addrmode6:$addr),
		(VST1LNq16Pseudo addrmode6:$addr, QPR:$src, imm:$lane)>;

// ...with address register writeback:		// ...with address register writeback:
class VST1LNWB<bits<4> op11_8, bits<4> op7_4, string Dt, ValueType Ty,		class VST1LNWB<bits<4> op11_8, bits<4> op7_4, string Dt, ValueType Ty,
PatFrag StoreOp, SDNode ExtractOp, Operand AdrMode>		PatFrag StoreOp, SDNode ExtractOp, Operand AdrMode>
: NLdStLn<1, 0b00, op11_8, op7_4, (outs GPR:$wb),		: NLdStLn<1, 0b00, op11_8, op7_4, (outs GPR:$wb),
(ins AdrMode:$Rn, am6offset:$Rm,		(ins AdrMode:$Rn, am6offset:$Rm,
DPR:$Vd, nohash_imm:$lane), IIC_VST1lnu, "vst1", Dt,		DPR:$Vd, nohash_imm:$lane), IIC_VST1lnu, "vst1", Dt,
"\\{$Vd[$lane]\\}, $Rn$Rm",		"\\{$Vd[$lane]\\}, $Rn$Rm",
"$Rn.addr = $wb",		"$Rn.addr = $wb",
▲ Show 20 Lines • Show All 313 Lines • ▼ Show 20 Lines

// Extract S sub-registers of Q/D registers.		// Extract S sub-registers of Q/D registers.
def SSubReg_f32_reg : SDNodeXForm<imm, [{		def SSubReg_f32_reg : SDNodeXForm<imm, [{
assert(ARM::ssub_3 == ARM::ssub_0+3 && "Unexpected subreg numbering");		assert(ARM::ssub_3 == ARM::ssub_0+3 && "Unexpected subreg numbering");
return CurDAG->getTargetConstant(ARM::ssub_0 + N->getZExtValue(), SDLoc(N),		return CurDAG->getTargetConstant(ARM::ssub_0 + N->getZExtValue(), SDLoc(N),
MVT::i32);		MVT::i32);
}]>;		}]>;

		// Extract S sub-registers of Q/D registers containing a given f16 lane.
		def SSubReg_f16_reg : SDNodeXForm<imm, [{
		assert(ARM::ssub_3 == ARM::ssub_0+3 && "Unexpected subreg numbering");
		return CurDAG->getTargetConstant(ARM::ssub_0 + N->getZExtValue()/2, SDLoc(N),
		MVT::i32);
		}]>;

// Translate lane numbers from Q registers to D subregs.		// Translate lane numbers from Q registers to D subregs.
def SubReg_i8_lane : SDNodeXForm<imm, [{		def SubReg_i8_lane : SDNodeXForm<imm, [{
return CurDAG->getTargetConstant(N->getZExtValue() & 7, SDLoc(N), MVT::i32);		return CurDAG->getTargetConstant(N->getZExtValue() & 7, SDLoc(N), MVT::i32);
}]>;		}]>;
def SubReg_i16_lane : SDNodeXForm<imm, [{		def SubReg_i16_lane : SDNodeXForm<imm, [{
return CurDAG->getTargetConstant(N->getZExtValue() & 3, SDLoc(N), MVT::i32);		return CurDAG->getTargetConstant(N->getZExtValue() & 3, SDLoc(N), MVT::i32);
}]>;		}]>;
def SubReg_i32_lane : SDNodeXForm<imm, [{		def SubReg_i32_lane : SDNodeXForm<imm, [{
▲ Show 20 Lines • Show All 3,703 Lines • ▼ Show 20 Lines
def : Pat<(extractelt (v4f32 QPR:$src1), imm:$src2),		def : Pat<(extractelt (v4f32 QPR:$src1), imm:$src2),
(EXTRACT_SUBREG (v4f32 (COPY_TO_REGCLASS (v4f32 QPR:$src1),QPR_VFP2)),		(EXTRACT_SUBREG (v4f32 (COPY_TO_REGCLASS (v4f32 QPR:$src1),QPR_VFP2)),
(SSubReg_f32_reg imm:$src2))>;		(SSubReg_f32_reg imm:$src2))>;
//def : Pat<(extractelt (v2i64 QPR:$src1), imm:$src2),		//def : Pat<(extractelt (v2i64 QPR:$src1), imm:$src2),
// (EXTRACT_SUBREG QPR:$src1, (DSubReg_f64_reg imm:$src2))>;		// (EXTRACT_SUBREG QPR:$src1, (DSubReg_f64_reg imm:$src2))>;
def : Pat<(extractelt (v2f64 QPR:$src1), imm:$src2),		def : Pat<(extractelt (v2f64 QPR:$src1), imm:$src2),
(EXTRACT_SUBREG QPR:$src1, (DSubReg_f64_reg imm:$src2))>;		(EXTRACT_SUBREG QPR:$src1, (DSubReg_f64_reg imm:$src2))>;

		def imm_even : ImmLeaf<i32, [{ return (Imm & 1) == 0; }]>;
		def imm_odd : ImmLeaf<i32, [{ return (Imm & 1) == 1; }]>;

		def : Pat<(extractelt (v4f16 DPR:$src), imm_even:$lane),
		(EXTRACT_SUBREG
		(v2f32 (COPY_TO_REGCLASS (v4f16 DPR:$src), DPR_VFP2)),
		(SSubReg_f16_reg imm_even:$lane))>;

		def : Pat<(extractelt (v4f16 DPR:$src), imm_odd:$lane),
		(COPY_TO_REGCLASS
		(VMOVH (EXTRACT_SUBREG
		(v2f32 (COPY_TO_REGCLASS (v4f16 DPR:$src), DPR_VFP2)),
		(SSubReg_f16_reg imm_odd:$lane))),
		HPR)>;

		def : Pat<(extractelt (v8f16 QPR:$src), imm_even:$lane),
		(EXTRACT_SUBREG
		(v4f32 (COPY_TO_REGCLASS (v8f16 QPR:$src), QPR_VFP2)),
		(SSubReg_f16_reg imm_even:$lane))>;

		def : Pat<(extractelt (v8f16 QPR:$src), imm_odd:$lane),
		(COPY_TO_REGCLASS
		(VMOVH (EXTRACT_SUBREG
		(v4f32 (COPY_TO_REGCLASS (v8f16 QPR:$src), QPR_VFP2)),
		(SSubReg_f16_reg imm_odd:$lane))),
		HPR)>;

// VMOV : Vector Set Lane (move ARM core register to scalar)		// VMOV : Vector Set Lane (move ARM core register to scalar)

let Constraints = "$src1 = $V" in {		let Constraints = "$src1 = $V" in {
def VSETLNi8 : NVSetLane<{1,1,1,0,0,1,?,0}, 0b1011, {?,?}, (outs DPR:$V),		def VSETLNi8 : NVSetLane<{1,1,1,0,0,1,?,0}, 0b1011, {?,?}, (outs DPR:$V),
(ins DPR:$src1, GPR:$R, VectorIndex8:$lane),		(ins DPR:$src1, GPR:$R, VectorIndex8:$lane),
IIC_VMOVISL, "vmov", "8", "$V$lane, $R",		IIC_VMOVISL, "vmov", "8", "$V$lane, $R",
[(set DPR:$V, (vector_insert (v8i8 DPR:$src1),		[(set DPR:$V, (vector_insert (v8i8 DPR:$src1),
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines

def : Pat<(v2f32 (insertelt DPR:$src1, SPR:$src2, imm:$src3)),		def : Pat<(v2f32 (insertelt DPR:$src1, SPR:$src2, imm:$src3)),
(INSERT_SUBREG (v2f32 (COPY_TO_REGCLASS DPR:$src1, DPR_VFP2)),		(INSERT_SUBREG (v2f32 (COPY_TO_REGCLASS DPR:$src1, DPR_VFP2)),
SPR:$src2, (SSubReg_f32_reg imm:$src3))>;		SPR:$src2, (SSubReg_f32_reg imm:$src3))>;
def : Pat<(v4f32 (insertelt QPR:$src1, SPR:$src2, imm:$src3)),		def : Pat<(v4f32 (insertelt QPR:$src1, SPR:$src2, imm:$src3)),
(INSERT_SUBREG (v4f32 (COPY_TO_REGCLASS QPR:$src1, QPR_VFP2)),		(INSERT_SUBREG (v4f32 (COPY_TO_REGCLASS QPR:$src1, QPR_VFP2)),
SPR:$src2, (SSubReg_f32_reg imm:$src3))>;		SPR:$src2, (SSubReg_f32_reg imm:$src3))>;

		def : Pat<(insertelt (v4f16 DPR:$src1), HPR:$src2, imm:$lane),
		(v4f16 (VSETLNi16 DPR:$src1, (VMOVRH $src2), imm:$lane))>;
		def : Pat<(insertelt (v8f16 QPR:$src1), HPR:$src2, imm:$lane),
		(v8f16 (INSERT_SUBREG QPR:$src1,
		(v4i16 (VSETLNi16 (v4i16 (EXTRACT_SUBREG QPR:$src1,
		(DSubReg_i16_reg imm:$lane))),
		(VMOVRH $src2), (SubReg_i16_lane imm:$lane))),
		(DSubReg_i16_reg imm:$lane)))>;

//def : Pat<(v2i64 (insertelt QPR:$src1, DPR:$src2, imm:$src3)),		//def : Pat<(v2i64 (insertelt QPR:$src1, DPR:$src2, imm:$src3)),
// (INSERT_SUBREG QPR:$src1, DPR:$src2, (DSubReg_f64_reg imm:$src3))>;		// (INSERT_SUBREG QPR:$src1, DPR:$src2, (DSubReg_f64_reg imm:$src3))>;
def : Pat<(v2f64 (insertelt QPR:$src1, DPR:$src2, imm:$src3)),		def : Pat<(v2f64 (insertelt QPR:$src1, DPR:$src2, imm:$src3)),
(INSERT_SUBREG QPR:$src1, DPR:$src2, (DSubReg_f64_reg imm:$src3))>;		(INSERT_SUBREG QPR:$src1, DPR:$src2, (DSubReg_f64_reg imm:$src3))>;

def : Pat<(v2f32 (scalar_to_vector SPR:$src)),		def : Pat<(v2f32 (scalar_to_vector SPR:$src)),
(INSERT_SUBREG (v2f32 (IMPLICIT_DEF)), SPR:$src, ssub_0)>;		(INSERT_SUBREG (v2f32 (IMPLICIT_DEF)), SPR:$src, ssub_0)>;
def : Pat<(v2f64 (scalar_to_vector (f64 DPR:$src))),		def : Pat<(v2f64 (scalar_to_vector (f64 DPR:$src))),
▲ Show 20 Lines • Show All 2,393 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/ARM/fp16-insert-extract.ll

				; RUN: llc -mtriple=arm-eabi -mattr=+armv8.2-a,+fullfp16,+neon -float-abi=hard -O1 < %s \| FileCheck %s
				; RUN: llc -mtriple=arm-eabi -mattr=+armv8.2-a,+fullfp16,+neon -float-abi=soft -O1 < %s \| FileCheck %s

				define float @test_vget_lane_f16_1(<4 x half> %a) nounwind {
				; CHECK-LABEL: test_vget_lane_f16_1:
				; CHECK: vmovx.f16 s0, s0
				; CHECK-NEXT: vcvtb.f32.f16 s0, s0
				entry:
				%elt = extractelement <4 x half> %a, i32 1
				%conv = fpext half %elt to float
				ret float %conv
				}

				define float @test_vget_lane_f16_2(<4 x half> %a) nounwind {
				; CHECK-LABEL: test_vget_lane_f16_2:
				; CHECK-NOT: vmovx.f16
				; CHECK: vcvtb.f32.f16 s0, s1
				entry:
				%elt = extractelement <4 x half> %a, i32 2
				%conv = fpext half %elt to float
				ret float %conv
				}

				define float @test_vget_laneq_f16_6(<8 x half> %a) nounwind {
				; CHECK-LABEL: test_vget_laneq_f16_6:
				; CHECK-NOT: vmovx.f16
				; CHECK: vcvtb.f32.f16 s0, s3
				entry:
				%elt = extractelement <8 x half> %a, i32 6
				%conv = fpext half %elt to float
				ret float %conv
				}

				define float @test_vget_laneq_f16_7(<8 x half> %a) nounwind {
				; CHECK-LABEL: test_vget_laneq_f16_7:
				; CHECK: vmovx.f16 s0, s3
				; CHECK: vcvtb.f32.f16 s0, s0
				entry:
				%elt = extractelement <8 x half> %a, i32 7
				%conv = fpext half %elt to float
				ret float %conv
				}

				define <4 x half> @test_vset_lane_f16(<4 x half> %a, float %fb) nounwind {
				; CHECK-LABEL: test_vset_lane_f16:
				; CHECK: vmov.f16 r[[GPR:[0-9]+]], s{{[0-9]+}}
				; CHECK: vmov.16 d{{[0-9]+}}[3], r[[GPR]]
				entry:
				%b = fptrunc float %fb to half
				%x = insertelement <4 x half> %a, half %b, i32 3
				ret <4 x half> %x
				}

				define <8 x half> @test_vset_laneq_f16_1(<8 x half> %a, float %fb) nounwind {
				; CHECK-LABEL: test_vset_laneq_f16_1:
				; CHECK: vmov.f16 r[[GPR:[0-9]+]], s{{[0-9]+}}
				; CHECK: vmov.16 d{{[0-9]+}}[1], r[[GPR]]
				entry:
				%b = fptrunc float %fb to half
				%x = insertelement <8 x half> %a, half %b, i32 1
				ret <8 x half> %x
				}

				define <8 x half> @test_vset_laneq_f16_7(<8 x half> %a, float %fb) nounwind {
				; CHECK-LABEL: test_vset_laneq_f16_7:
				; CHECK: vmov.f16 r[[GPR:[0-9]+]], s{{[0-9]+}}
				; CHECK: vmov.16 d{{[0-9]+}}[3], r[[GPR]]
				entry:
				%b = fptrunc float %fb to half
				%x = insertelement <8 x half> %a, half %b, i32 7
				ret <8 x half> %x
				}

llvm/trunk/test/CodeGen/ARM/fp16-vldlane-vstlane.ll

				; RUN: llc -mtriple=arm-eabi -mattr=+armv8.2-a,+fullfp16,+neon -float-abi=hard -O1 < %s \| FileCheck %s
				; RUN: llc -mtriple=arm-eabi -mattr=+armv8.2-a,+fullfp16,+neon -float-abi=soft -O1 < %s \| FileCheck %s

				define <4 x half> @vld1d_lane_f16(half* %pa, <4 x half> %v4) nounwind {
				; CHECK-LABEL: vld1d_lane_f16:
				; CHECK: vld1.16 {d{{[0-9]+}}[3]}, [r0:16]
				entry:
				%a = load half, half* %pa
				%res = insertelement <4 x half> %v4, half %a, i32 3
				ret <4 x half> %res
				}

				define <8 x half> @vld1q_lane_f16_1(half* %pa, <8 x half> %v8) nounwind {
				; CHECK-LABEL: vld1q_lane_f16_1:
				; CHECK: vld1.16 {d{{[0-9]+}}[1]}, [r0:16]
				entry:
				%a = load half, half* %pa
				%res = insertelement <8 x half> %v8, half %a, i32 1
				ret <8 x half> %res
				}

				define <8 x half> @vld1q_lane_f16_7(half* %pa, <8 x half> %v8) nounwind {
				; CHECK-LABEL: vld1q_lane_f16_7:
				; CHECK: vld1.16 {d{{[0-9]+}}[3]}, [r0:16]
				entry:
				%a = load half, half* %pa
				%res = insertelement <8 x half> %v8, half %a, i32 7
				ret <8 x half> %res
				}

				define void @vst1d_lane_f16(half* %pa, <4 x half> %v4) nounwind {
				; CHECK-LABEL: vst1d_lane_f16:
				; CHECK: vst1.16 {d{{[0-9]+}}[3]}, [r0:16]
				entry:
				%a = extractelement <4 x half> %v4, i32 3
				store half %a, half* %pa
				ret void
				}

				define void @vst1q_lane_f16_7(half* %pa, <8 x half> %v8) nounwind {
				; CHECK-LABEL: vst1q_lane_f16_7:
				; CHECK: vst1.16 {d{{[0-9]+}}[3]}, [r0:16]
				entry:
				%a = extractelement <8 x half> %v8, i32 7
				store half %a, half* %pa
				ret void
				}

				define void @vst1q_lane_f16_1(half* %pa, <8 x half> %v8) nounwind {
				; CHECK-LABEL: vst1q_lane_f16_1:
				; CHECK: vst1.16 {d{{[0-9]+}}[1]}, [r0:16]
				entry:
				%a = extractelement <8 x half> %v8, i32 1
				store half %a, half* %pa
				ret void
				}