Download Raw Diff

Details

Reviewers

efriedma
paquette
dmgreen
SjoerdMeijer

Commits

rG772aaa602383: [AArch64] Improve lowering of insert_vector_elt with 0.0 consts.

Summary

When moving 0.0 into a float vector, we can use to vi*gpr variants of
INS. I am not sure if we can easily express this in the tablegen
descriptions, because INS*vi*gpr is only defined for integer vectors and
I am not sure how to convert things there.

This patch extends LowerINSERT_VECTOR_ELT to bitcast the input float
vector to an integer vector, apply the insertion and bitcast the result
back.

This way, we can piggy-back on the matching for the integer variants.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Oct 26 2020, 11:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2020, 11:02 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

fhahn requested review of this revision.Oct 26 2020, 11:02 AM

Harbormaster completed remote builds in B76442: Diff 300742.Oct 26 2020, 11:41 AM

fhahn mentioned this in D90233: [AArch64] Use DUP for BUILD_VECTOR with few different elements..Oct 27 2020, 8:05 AM

We could turn this into a more general combine, We use fmov from a GPR to materialize fp constants in other cases. But maybe just zero is fine to start.

I am not sure if we can easily express this in the tablegen descriptions, because INS*vi*gpr is only defined for integer vectors and I am not sure how to convert things there.

MachineInstrs don't distinguish between integer and float vectors; there's only V64/V128 register classes. So the types only matter for the inputs to tablegen patterns, not the outputs. You can write a pattern that takes a float vector as an input and produces an INSvi32gpr without any casting.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9257 ↗	(On Diff #300742)	I think isZero is true for -0.0?

Updated to make sure we do not do this transform for -0.0, thanks!

In D90176#2357309, @efriedma wrote:

We could turn this into a more general combine, We use fmov from a GPR to materialize fp constants in other cases. But maybe just zero is fine to start.

So for other FP constants that have values that can be cheaply matrialized into GPRs? That would be good. I guess that would be helpful in all contexts where we can use GPRs directly instead of FPRs. Is there a way to match something like that generically?

Happy to look into that, but would prefer to start just with the 0.0 case.

I am not sure if we can easily express this in the tablegen descriptions, because INS*vi*gpr is only defined for integer vectors and I am not sure how to convert things there.

MachineInstrs don't distinguish between integer and float vectors; there's only V64/V128 register classes. So the types only matter for the inputs to tablegen patterns, not the outputs. You can write a pattern that takes a float vector as an input and produces an INSvi32gpr without any casting.

Thanks! I think I was running into some ambiguity with the patterns, not sure exactly what was going on. Would you prefer the tablegen version?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9257 ↗	(On Diff #300742)	Indeed, thanks for spotting this!

In D90176#2357415, @fhahn wrote:

In D90176#2357309, @efriedma wrote:

We could turn this into a more general combine, We use fmov from a GPR to materialize fp constants in other cases. But maybe just zero is fine to start.

So for other FP constants that have values that can be cheaply matrialized into GPRs? That would be good. I guess that would be helpful in all contexts where we can use GPRs directly instead of FPRs. Is there a way to match something like that generically?

For values that aren't literals, there would be a bitcast or something like that.

For literals, the rule for whether we use a GPR is complicated; see AArch64TargetLowering::isFPImmLegal. But once we get to isel, the actual lowering is pretty simple; see bitcast_fpimm_to_i32.

Happy to look into that, but would prefer to start just with the 0.0 case.

I am not sure if we can easily express this in the tablegen descriptions, because INS*vi*gpr is only defined for integer vectors and I am not sure how to convert things there.

MachineInstrs don't distinguish between integer and float vectors; there's only V64/V128 register classes. So the types only matter for the inputs to tablegen patterns, not the outputs. You can write a pattern that takes a float vector as an input and produces an INSvi32gpr without any casting.

Thanks! I think I was running into some ambiguity with the patterns, not sure exactly what was going on. Would you prefer the tablegen version?

I'm a little worried this version will make other combines more complicated; I'd prefer the TableGen version if it isn't too hard.

Harbormaster completed remote builds in B76622: Diff 301087.Oct 27 2020, 2:11 PM

In D90176#2357487, @efriedma wrote:

In D90176#2357415, @fhahn wrote:

Thanks! I think I was running into some ambiguity with the patterns, not sure exactly what was going on. Would you prefer the tablegen version?

I'm a little worried this version will make other combines more complicated; I'd prefer the TableGen version if it isn't too hard.

Thanks, I figured out what was going wrong with the tablegen version initially. The tablegen patterns now work for f32 and f64, but for some reason the matching does not work for f16.

Could this be related to fpimm0 not matching 0.0 for f16? It prints ConstantFP:f16<APFloat(0)> instead of ConstantFP:f32<0.00000> for f32 constants.

Harbormaster completed remote builds in B76700: Diff 301225.Oct 28 2020, 4:31 AM

but for some reason the matching does not work for f16

I think isFPImmLegal is false, so we're forcing a constant pool. I think you can make the testcase work with +fullfp16?

fhahn mentioned this in rGba78cae20f14: [AArch64] Use DUP for BUILD_VECTOR with few different elements..Oct 28 2020, 12:52 PM

In D90176#2359750, @efriedma wrote:

but for some reason the matching does not work for f16

I think isFPImmLegal is false, so we're forcing a constant pool. I think you can make the testcase work with +fullfp16?

Yeah, that was the issue. Adding +fullfp16 solved the issue. Updated the tests. It's probably not worth trying to improve f16 codegen without +fullfp16.

LGTM

This revision is now accepted and ready to land.Oct 28 2020, 2:05 PM

This revision was landed with ongoing or failed builds.Oct 28 2020, 2:35 PM

Closed by commit rG772aaa602383: [AArch64] Improve lowering of insert_vector_elt with 0.0 consts. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rG772aaa602383: [AArch64] Improve lowering of insert_vector_elt with 0.0 consts..

Thanks for the review!

Harbormaster completed remote builds in B76798: Diff 301401.Oct 28 2020, 4:13 PM

Diff 301439

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,171 Lines • ▼ Show 20 Lines	def : Pat<(v4f16 (vector_insert (v4f16 V64:$Rn),
(EXTRACT_SUBREG		(EXTRACT_SUBREG
(INSvi16lane		(INSvi16lane
(v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), V64:$Rn, dsub)),		(v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), V64:$Rn, dsub)),
VectorIndexS:$imm,		VectorIndexS:$imm,
(v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), FPR16:$Rm, hsub)),		(v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), FPR16:$Rm, hsub)),
(i64 0)),		(i64 0)),
dsub)>;		dsub)>;

		def : Pat<(vector_insert (v8f16 v8f16:$Rn), (f16 fpimm0),
		(i64 VectorIndexH:$imm)),
		(INSvi16gpr V128:$Rn, VectorIndexH:$imm, WZR)>;
		def : Pat<(vector_insert v4f32:$Rn, (f32 fpimm0),
		(i64 VectorIndexS:$imm)),
		(INSvi32gpr V128:$Rn, VectorIndexS:$imm, WZR)>;
		def : Pat<(vector_insert v2f64:$Rn, (f64 fpimm0),
		(i64 VectorIndexD:$imm)),
		(INSvi64gpr V128:$Rn, VectorIndexS:$imm, XZR)>;

def : Pat<(v8f16 (vector_insert (v8f16 V128:$Rn),		def : Pat<(v8f16 (vector_insert (v8f16 V128:$Rn),
(f16 FPR16:$Rm), (i64 VectorIndexH:$imm))),		(f16 FPR16:$Rm), (i64 VectorIndexH:$imm))),
(INSvi16lane		(INSvi16lane
V128:$Rn, VectorIndexH:$imm,		V128:$Rn, VectorIndexH:$imm,
(v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), FPR16:$Rm, hsub)),		(v8f16 (INSERT_SUBREG (v8f16 (IMPLICIT_DEF)), FPR16:$Rm, hsub)),
(i64 0))>;		(i64 0))>;

def : Pat<(v4bf16 (vector_insert (v4bf16 V64:$Rn),		def : Pat<(v4bf16 (vector_insert (v4bf16 V64:$Rn),
▲ Show 20 Lines • Show All 2,508 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc < %s -mtriple=arm64-eabi -mcpu=generic -aarch64-neon-syntax=apple \| FileCheck %s		; RUN: llc < %s -mtriple=arm64-eabi -mcpu=generic -aarch64-neon-syntax=apple -mattr="+fullfp16" \| FileCheck %s

define void @test0f(float* nocapture %x, float %a) #0 {		define void @test0f(float* nocapture %x, float %a) #0 {
; CHECK-LABEL: test0f:		; CHECK-LABEL: test0f:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: movi.2d v1, #0000000000000000		; CHECK-NEXT: movi.2d v1, #0000000000000000
; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0		; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0
; CHECK-NEXT: mov.s v1[0], v0[0]		; CHECK-NEXT: mov.s v1[0], v0[0]
; CHECK-NEXT: str q1, [x0]		; CHECK-NEXT: str q1, [x0]
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%v.15 = insertelement <16 x i8> %v.14, i8 %b, i32 15		%v.15 = insertelement <16 x i8> %v.14, i8 %b, i32 15
ret <16 x i8> %v.15		ret <16 x i8> %v.15
}		}

define <8 x half> @test_insert_v8f16_insert_1(half %a) {		define <8 x half> @test_insert_v8f16_insert_1(half %a) {
; CHECK-LABEL: test_insert_v8f16_insert_1:		; CHECK-LABEL: test_insert_v8f16_insert_1:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0		; CHECK-NEXT: // kill: def $h0 killed $h0 def $q0
; CHECK-NEXT: adrp x8, .LCPI6_0
; CHECK-NEXT: dup.8h v0, v0[0]		; CHECK-NEXT: dup.8h v0, v0[0]
; CHECK-NEXT: add x8, x8, :lo12:.LCPI6_0		; CHECK-NEXT: mov.h v0[7], wzr
; CHECK-NEXT: ld1.h { v0 }[7], [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <8 x half> <half undef, half undef, half undef, half undef, half undef, half undef, half undef, half 0.0>, half %a, i32 0		%v.0 = insertelement <8 x half> <half undef, half undef, half undef, half undef, half undef, half undef, half undef, half 0.0>, half %a, i32 0
%v.1 = insertelement <8 x half> %v.0, half %a, i32 1		%v.1 = insertelement <8 x half> %v.0, half %a, i32 1
%v.2 = insertelement <8 x half> %v.1, half %a, i32 2		%v.2 = insertelement <8 x half> %v.1, half %a, i32 2
%v.3 = insertelement <8 x half> %v.2, half %a, i32 3		%v.3 = insertelement <8 x half> %v.2, half %a, i32 3
%v.4 = insertelement <8 x half> %v.3, half %a, i32 4		%v.4 = insertelement <8 x half> %v.3, half %a, i32 4
%v.5 = insertelement <8 x half> %v.4, half %a, i32 5		%v.5 = insertelement <8 x half> %v.4, half %a, i32 5
%v.6 = insertelement <8 x half> %v.5, half %a, i32 6		%v.6 = insertelement <8 x half> %v.5, half %a, i32 6
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%v.0 = insertelement <2 x float> <float 0.000000e+00, float undef>, float %a, i32 1		%v.0 = insertelement <2 x float> <float 0.000000e+00, float undef>, float %a, i32 1
ret <2 x float> %v.0		ret <2 x float> %v.0
}		}

define <4 x float> @test_insert_3_f32_undef_zero_vector(float %a) {		define <4 x float> @test_insert_3_f32_undef_zero_vector(float %a) {
; CHECK-LABEL: test_insert_3_f32_undef_zero_vector:		; CHECK-LABEL: test_insert_3_f32_undef_zero_vector:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0		; CHECK-NEXT: // kill: def $s0 killed $s0 def $q0
; CHECK-NEXT: fmov s1, wzr
; CHECK-NEXT: dup.4s v0, v0[0]		; CHECK-NEXT: dup.4s v0, v0[0]
; CHECK-NEXT: mov.s v0[3], v1[0]		; CHECK-NEXT: mov.s v0[3], wzr
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <4 x float> <float undef, float undef, float undef, float 0.000000e+00>, float %a, i32 0		%v.0 = insertelement <4 x float> <float undef, float undef, float undef, float 0.000000e+00>, float %a, i32 0
%v.1 = insertelement <4 x float> %v.0, float %a, i32 1		%v.1 = insertelement <4 x float> %v.0, float %a, i32 1
%v.2 = insertelement <4 x float> %v.1, float %a, i32 2		%v.2 = insertelement <4 x float> %v.1, float %a, i32 2
ret <4 x float> %v.2		ret <4 x float> %v.2
}		}

define <4 x float> @test_insert_3_f32_undef(float %a) {		define <4 x float> @test_insert_3_f32_undef(float %a) {
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
; CHECK-NEXT: mov.s v1[0], v0[0]		; CHECK-NEXT: mov.s v1[0], v0[0]
; CHECK-NEXT: mov.s v1[2], v0[0]		; CHECK-NEXT: mov.s v1[2], v0[0]
; CHECK-NEXT: mov.16b v0, v1		; CHECK-NEXT: mov.16b v0, v1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <4 x float> %b, float %a, i32 0		%v.0 = insertelement <4 x float> %b, float %a, i32 0
%v.1 = insertelement <4 x float> %v.0, float %a, i32 2		%v.1 = insertelement <4 x float> %v.0, float %a, i32 2
ret <4 x float> %v.1		ret <4 x float> %v.1
}		}

define <8 x i16> @test_insert_v8i16_i16_zero(<8 x i16> %a) {		define <8 x i16> @test_insert_v8i16_i16_zero(<8 x i16> %a) {
; CHECK-LABEL: test_insert_v8i16_i16_zero:		; CHECK-LABEL: test_insert_v8i16_i16_zero:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov.h v0[5], wzr		; CHECK-NEXT: mov.h v0[5], wzr
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <8 x i16> %a, i16 0, i32 5		%v.0 = insertelement <8 x i16> %a, i16 0, i32 5
ret <8 x i16> %v.0		ret <8 x i16> %v.0
}		}

; TODO: This should jsut be a mov.s v0[3], wzr		; TODO: This should jsut be a mov.s v0[3], wzr
define <4 x half> @test_insert_v4f16_f16_zero(<4 x half> %a) {		define <4 x half> @test_insert_v4f16_f16_zero(<4 x half> %a) {
; CHECK-LABEL: test_insert_v4f16_f16_zero:		; CHECK-LABEL: test_insert_v4f16_f16_zero:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI19_0
; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
; CHECK-NEXT: add x8, x8, :lo12:.LCPI19_0		; CHECK-NEXT: mov.h v0[0], wzr
; CHECK-NEXT: ld1.h { v0 }[0], [x8]
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <4 x half> %a, half 0.000000e+00, i32 0		%v.0 = insertelement <4 x half> %a, half 0.000000e+00, i32 0
ret <4 x half> %v.0		ret <4 x half> %v.0
}		}

define <8 x half> @test_insert_v8f16_f16_zero(<8 x half> %a) {		define <8 x half> @test_insert_v8f16_f16_zero(<8 x half> %a) {
; CHECK-LABEL: test_insert_v8f16_f16_zero:		; CHECK-LABEL: test_insert_v8f16_f16_zero:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI20_0		; CHECK-NEXT: mov.h v0[6], wzr
; CHECK-NEXT: add x8, x8, :lo12:.LCPI20_0
; CHECK-NEXT: ld1.h { v0 }[6], [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <8 x half> %a, half 0.000000e+00, i32 6		%v.0 = insertelement <8 x half> %a, half 0.000000e+00, i32 6
ret <8 x half> %v.0		ret <8 x half> %v.0
}		}

define <2 x float> @test_insert_v2f32_f32_zero(<2 x float> %a) {		define <2 x float> @test_insert_v2f32_f32_zero(<2 x float> %a) {
; CHECK-LABEL: test_insert_v2f32_f32_zero:		; CHECK-LABEL: test_insert_v2f32_f32_zero:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
; CHECK-NEXT: fmov s1, wzr		; CHECK-NEXT: mov.s v0[0], wzr
; CHECK-NEXT: mov.s v0[0], v1[0]
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <2 x float> %a, float 0.000000e+00, i32 0		%v.0 = insertelement <2 x float> %a, float 0.000000e+00, i32 0
ret <2 x float> %v.0		ret <2 x float> %v.0
}		}

define <4 x float> @test_insert_v4f32_f32_zero(<4 x float> %a) {		define <4 x float> @test_insert_v4f32_f32_zero(<4 x float> %a) {
; CHECK-LABEL: test_insert_v4f32_f32_zero:		; CHECK-LABEL: test_insert_v4f32_f32_zero:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmov s1, wzr		; CHECK-NEXT: mov.s v0[3], wzr
; CHECK-NEXT: mov.s v0[3], v1[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <4 x float> %a, float 0.000000e+00, i32 3		%v.0 = insertelement <4 x float> %a, float 0.000000e+00, i32 3
ret <4 x float> %v.0		ret <4 x float> %v.0
}		}

define <2 x double> @test_insert_v2f64_f64_zero(<2 x double> %a) {		define <2 x double> @test_insert_v2f64_f64_zero(<2 x double> %a) {
; CHECK-LABEL: test_insert_v2f64_f64_zero:		; CHECK-LABEL: test_insert_v2f64_f64_zero:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: fmov d1, xzr		; CHECK-NEXT: mov.d v0[1], xzr
; CHECK-NEXT: mov.d v0[1], v1[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v.0 = insertelement <2 x double> %a, double 0.000000e+00, i32 1		%v.0 = insertelement <2 x double> %a, double 0.000000e+00, i32 1
ret <2 x double> %v.0		ret <2 x double> %v.0
}		}

llvm/test/CodeGen/AArch64/vecreduce-fadd-legalization.ll

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%b = call fast nnan fp128 @llvm.vector.reduce.fadd.f128.v1f128(fp128 zeroinitializer, <1 x fp128> %a)			%b = call fast nnan fp128 @llvm.vector.reduce.fadd.f128.v1f128(fp128 zeroinitializer, <1 x fp128> %a)
	ret fp128 %b			ret fp128 %b
	}			}

	define float @test_v3f32(<3 x float> %a) nounwind {			define float @test_v3f32(<3 x float> %a) nounwind {
	; CHECK-LABEL: test_v3f32:			; CHECK-LABEL: test_v3f32:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: fmov s1, wzr			; CHECK-NEXT: mov v0.s[3], wzr
	; CHECK-NEXT: mov v0.s[3], v1.s[0]
	; CHECK-NEXT: ext v1.16b, v0.16b, v0.16b, #8			; CHECK-NEXT: ext v1.16b, v0.16b, v0.16b, #8
	; CHECK-NEXT: fadd v0.2s, v0.2s, v1.2s			; CHECK-NEXT: fadd v0.2s, v0.2s, v1.2s
	; CHECK-NEXT: faddp s0, v0.2s			; CHECK-NEXT: faddp s0, v0.2s
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%b = call fast nnan float @llvm.vector.reduce.fadd.f32.v3f32(float 0.0, <3 x float> %a)			%b = call fast nnan float @llvm.vector.reduce.fadd.f32.v3f32(float 0.0, <3 x float> %a)
	ret float %b			ret float %b
	}			}

	Show All 24 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve lowering of insert_vector_elt with 0.0 consts.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 301439

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll

llvm/test/CodeGen/AArch64/vecreduce-fadd-legalization.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve lowering of insert_vector_elt with 0.0 consts.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 301439

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/arm64-vector-insertion.ll

llvm/test/CodeGen/AArch64/vecreduce-fadd-legalization.ll

[AArch64] Improve lowering of insert_vector_elt with 0.0 consts.
ClosedPublic