This is an archive of the discontinued LLVM Phabricator instance.

Differential D51925

[AMDGPU] Fix issue for zext of f16 to i32
Needs ReviewPublic

Authored by dstuttard on Sep 11 2018, 4:47 AM.

Download Raw Diff

Details

Reviewers

tpr
arsenm

Summary

Vulkan exposed an issue with this for a case with v_mad_mixlo_f16 where the
upper 16 bits were not cleared.

Modifying this to clear the bits instead of just copying fixed the problem.

V2: Fixed up "Fix issue for zext of f16 to i32"
V3: Fixed fcanonicalize-elimination test

Diff Detail

Repository

rL LLVM

Build Status

Buildable 22471
Build 22471: arc lint + arc unit

Event Timeline

dstuttard created this revision.Sep 11 2018, 4:47 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 6 others. · View Herald TranscriptSep 11 2018, 4:47 AM

Harbormaster completed remote builds in B22471: Diff 164846.Sep 11 2018, 4:47 AM

dstuttard added reviewers: arsenm, tpr.Sep 11 2018, 4:50 AM

arsenm added inline comments.Sep 11 2018, 5:26 AM

lib/Target/AMDGPU/SIInstructions.td
1353	IIRC this node is only supposed to be emitted if the high bits are known zero, so something is wrong upstream if it’s gotten here

dstuttard added inline comments.Sep 11 2018, 6:13 AM

lib/Target/AMDGPU/SIInstructions.td
1353	Ok - thanks. I'll take another look.

Looking again at the code - you're correct that it attempts to only do this transformation if the high bits are zero.
However, the code that checks this has the following telling comment:

// (i32 zext (i16 (bitcast f16:$src))) -> fp16_zext $src
// FIXME: It is not universally true that the high bits are zeroed on gfx9.
if (Src.getOpcode() == ISD::BITCAST) {
  SDValue BCSrc = Src.getOperand(0);
  if (BCSrc.getValueType() == MVT::f16 &&
      fp16SrcZerosHighBits(BCSrc.getOpcode()))
    return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);
}

In this particular case the BCSrc operation was an fptrunc which passes the fp16SrcZerosHighBits test - but that eventually ends up as v_mad_mixlo_f16 which doesn't ensure that the high bits are zero.

Any suggestions on how to proceed? I agree that it seems a shame to have to insert the extra AND operation blindly.

In D51925#1230429, @dstuttard wrote:
Looking again at the code - you're correct that it attempts to only do this transformation if the high bits are zero.
However, the code that checks this has the following telling comment:
// (i32 zext (i16 (bitcast f16:$src))) -> fp16_zext $src
// FIXME: It is not universally true that the high bits are zeroed on gfx9.
if (Src.getOpcode() == ISD::BITCAST) {
  SDValue BCSrc = Src.getOperand(0);
  if (BCSrc.getValueType() == MVT::f16 &&
      fp16SrcZerosHighBits(BCSrc.getOpcode()))
    return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);
}
In this particular case the BCSrc operation was an fptrunc which passes the fp16SrcZerosHighBits test - but that eventually ends up as v_mad_mixlo_f16 which doesn't ensure that the high bits are zero.

Any suggestions on how to proceed? I agree that it seems a shame to have to insert the extra AND operation blindly.

I guess you could check the subtarget in fp16SrcZerosHighBits. However that's pretty risky since it's depending on things we can't guarantee. Something could transform any other instruction into something else that won't preserve this. Overall I'm very unhappy this hardware change happened and it's a lot of work to handle all of this properly. I think what we really need is to drop this combine/node, and a separate machine instruction for every operation that preserves the high bits (with a tied source operand) vs. zeros them, and then have a machine pass that tries to clean up the extra ands while dropping this combine. We'll have to do extra work because we will have missed out on combines that this was enabling.

In D51925#1233093, @arsenm wrote:
In D51925#1230429, @dstuttard wrote:
Looking again at the code - you're correct that it attempts to only do this transformation if the high bits are zero.
However, the code that checks this has the following telling comment:
// (i32 zext (i16 (bitcast f16:$src))) -> fp16_zext $src
// FIXME: It is not universally true that the high bits are zeroed on gfx9.
if (Src.getOpcode() == ISD::BITCAST) {
  SDValue BCSrc = Src.getOperand(0);
  if (BCSrc.getValueType() == MVT::f16 &&
      fp16SrcZerosHighBits(BCSrc.getOpcode()))
    return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);
}
In this particular case the BCSrc operation was an fptrunc which passes the fp16SrcZerosHighBits test - but that eventually ends up as v_mad_mixlo_f16 which doesn't ensure that the high bits are zero.

Any suggestions on how to proceed? I agree that it seems a shame to have to insert the extra AND operation blindly.
I guess you could check the subtarget in fp16SrcZerosHighBits. However that's pretty risky since it's depending on things we can't guarantee. Something could transform any other instruction into something else that won't preserve this. Overall I'm very unhappy this hardware change happened and it's a lot of work to handle all of this properly. I think what we really need is to drop this combine/node, and a separate machine instruction for every operation that preserves the high bits (with a tied source operand) vs. zeros them, and then have a machine pass that tries to clean up the extra ands while dropping this combine. We'll have to do extra work because we will have missed out on combines that this was enabling.

OK - given that something like that is a larger change, how about we commit this (with an appropriate comment) for now and work on something better in the long term?

@arsenm Matt, any more comments? Would you be happy with a clarification comment as per the last suggestion from me?

ping

In D51925#1309692, @dstuttard wrote:

ping

What happens if you just drop the optimization entirely?

In D51925#1309875, @arsenm wrote:

In D51925#1309692, @dstuttard wrote:

ping

What happens if you just drop the optimization entirely?

Not sure what you mean - I get a load of lit test failures (31), but that's what I'd expect since it no longer does the transform to use FP16_ZEXT - unless you mean something else?

ping

I think 9ad8a1f6fb2aea775736cd59129b7299be443c5c fixed this problem

Herald added a project: Restricted Project. · View Herald TranscriptAug 26 2021, 6:04 PM

Herald added a subscriber: kerbowa. · View Herald Transcript

This can be abandoned

foad mentioned this in rGa2453c613085: [AMDGPU] Add test case for zext of f16 to i32.Jul 17 2023, 4:58 AM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIInstructions.td

7 lines

test/

CodeGen/

AMDGPU/

fcanonicalize-elimination.ll

4 lines

fptrunc.f16.ll

6 lines

mad-mix-lo.ll

16 lines

Diff 164846

lib/Target/AMDGPU/SIInstructions.td

	Show First 20 Lines • Show All 1,337 Lines • ▼ Show 20 Lines
	def : GCNPat <			def : GCNPat <
	(f64 (uint_to_fp i1:$src)),			(f64 (uint_to_fp i1:$src)),
	(V_CVT_F64_U32_e32 (V_CNDMASK_B32_e64 (i32 0), (i32 1), $src))			(V_CVT_F64_U32_e32 (V_CNDMASK_B32_e64 (i32 0), (i32 1), $src))
	>;			>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Miscellaneous Patterns			// Miscellaneous Patterns
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				let OtherPredicates = [ Predicate<"Subtarget->getGeneration() < AMDGPUSubtarget::GFX9"> ] in {
	def : GCNPat <			def : GCNPat <
	(i32 (AMDGPUfp16_zext f16:$src)),			(i32 (AMDGPUfp16_zext f16:$src)),
	(COPY $src)			(COPY $src)
	>;			>;
				}
				def : GCNPat <
				(i32 (AMDGPUfp16_zext f16:$src)),
				arsenmUnsubmitted Not Done Reply Inline Actions IIRC this node is only supposed to be emitted if the high bits are known zero, so something is wrong upstream if it’s gotten here arsenm: IIRC this node is only supposed to be emitted if the high bits are known zero, so something is…
				dstuttardAuthorUnsubmitted Not Done Reply Inline Actions Ok - thanks. I'll take another look. dstuttard: Ok - thanks. I'll take another look.
				(V_AND_B32_e64 $src, (V_MOV_B32_e32 (i32 0x0000ffff)))
				>;

	def : GCNPat <			def : GCNPat <
	(i32 (trunc i64:$a)),			(i32 (trunc i64:$a)),
	(EXTRACT_SUBREG $a, sub0)			(EXTRACT_SUBREG $a, sub0)
	>;			>;

	def : GCNPat <			def : GCNPat <
	(i1 (trunc i32:$a)),			(i1 (trunc i32:$a)),
	▲ Show 20 Lines • Show All 291 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/fcanonicalize-elimination.ll

	Show First 20 Lines • Show All 787 Lines • ▼ Show 20 Lines
	define half @v_test_canonicalize_extract_element_v2f16(<2 x half> %vec) {			define half @v_test_canonicalize_extract_element_v2f16(<2 x half> %vec) {
	%vec.op = fmul <2 x half> %vec, <half 4.0, half 4.0>			%vec.op = fmul <2 x half> %vec, <half 4.0, half 4.0>
	%elt = extractelement <2 x half> %vec.op, i32 0			%elt = extractelement <2 x half> %vec.op, i32 0
	%canonicalized = call half @llvm.canonicalize.f16(half %elt)			%canonicalized = call half @llvm.canonicalize.f16(half %elt)
	ret half %canonicalized			ret half %canonicalized
	}			}

	; GCN-LABEL: {{^}}v_test_canonicalize_insertelement_v2f16:			; GCN-LABEL: {{^}}v_test_canonicalize_insertelement_v2f16:
	; GFX9: v_pk_mul_f16			; GFX9-DAG: v_pk_mul_f16
	; GFX9: v_mul_f16_e32			; GFX9-DAG: v_mul_f16_e32
	; GFX9-NOT: v_max			; GFX9-NOT: v_max
	; GFX9-NOT: v_pk_max			; GFX9-NOT: v_pk_max
	define <2 x half> @v_test_canonicalize_insertelement_v2f16(<2 x half> %vec, half %val, i32 %idx) {			define <2 x half> @v_test_canonicalize_insertelement_v2f16(<2 x half> %vec, half %val, i32 %idx) {
	%vec.op = fmul <2 x half> %vec, <half 4.0, half 4.0>			%vec.op = fmul <2 x half> %vec, <half 4.0, half 4.0>
	%ins.op = fmul half %val, 8.0			%ins.op = fmul half %val, 8.0
	%ins = insertelement <2 x half> %vec.op, half %ins.op, i32 %idx			%ins = insertelement <2 x half> %vec.op, half %ins.op, i32 %idx
	%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %ins)			%canonicalized = call <2 x half> @llvm.canonicalize.v2f16(<2 x half> %ins)
	ret <2 x half> %canonicalized			ret <2 x half> %canonicalized
	▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/fptrunc.f16.ll

Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	entry:
%r.val = fptrunc float %a.fneg.fabs to half		%r.val = fptrunc float %a.fneg.fabs to half
store half %r.val, half addrspace(1)* %r		store half %r.val, half addrspace(1)* %r
ret void		ret void
}		}

; GCN-LABEL: {{^}}fptrunc_f32_to_f16_zext_i32:		; GCN-LABEL: {{^}}fptrunc_f32_to_f16_zext_i32:
; GCN: buffer_load_dword v[[A_F32:[0-9]+]]		; GCN: buffer_load_dword v[[A_F32:[0-9]+]]
; GCN: v_cvt_f16_f32_e32 v[[R_F16:[0-9]+]], v[[A_F32]]		; GCN: v_cvt_f16_f32_e32 v[[R_F16:[0-9]+]], v[[A_F32]]
; GCN-NOT: v[[R_F16]]		; SIVI-NOT: v[[R_F16]]
		; GFX9: v_and_b32_e32 v[[R_F16]], 0xffff, v[[R_F16]]
; GCN: buffer_store_dword v[[R_F16]]		; GCN: buffer_store_dword v[[R_F16]]
define amdgpu_kernel void @fptrunc_f32_to_f16_zext_i32(		define amdgpu_kernel void @fptrunc_f32_to_f16_zext_i32(
i32 addrspace(1)* %r,		i32 addrspace(1)* %r,
float addrspace(1)* %a) #0 {		float addrspace(1)* %a) #0 {
entry:		entry:
%a.val = load float, float addrspace(1)* %a		%a.val = load float, float addrspace(1)* %a
%r.val = fptrunc float %a.val to half		%r.val = fptrunc float %a.val to half
%r.i16 = bitcast half %r.val to i16		%r.i16 = bitcast half %r.val to i16
%zext = zext i16 %r.i16 to i32		%zext = zext i16 %r.i16 to i32
store i32 %zext, i32 addrspace(1)* %r		store i32 %zext, i32 addrspace(1)* %r
ret void		ret void
}		}

; GCN-LABEL: {{^}}fptrunc_fabs_f32_to_f16_zext_i32:		; GCN-LABEL: {{^}}fptrunc_fabs_f32_to_f16_zext_i32:
; GCN: buffer_load_dword v[[A_F32:[0-9]+]]		; GCN: buffer_load_dword v[[A_F32:[0-9]+]]
; GCN: v_cvt_f16_f32_e64 v[[R_F16:[0-9]+]], \|v[[A_F32]]\|		; GCN: v_cvt_f16_f32_e64 v[[R_F16:[0-9]+]], \|v[[A_F32]]\|
; GCN-NOT: v[[R_F16]]		; SIVI-NOT: v[[R_F16]]
		; GFX9: v_and_b32_e32 v[[R_F16]], 0xffff, v[[R_F16]]
; GCN: buffer_store_dword v[[R_F16]]		; GCN: buffer_store_dword v[[R_F16]]
define amdgpu_kernel void @fptrunc_fabs_f32_to_f16_zext_i32(		define amdgpu_kernel void @fptrunc_fabs_f32_to_f16_zext_i32(
i32 addrspace(1)* %r,		i32 addrspace(1)* %r,
float addrspace(1)* %a) #0 {		float addrspace(1)* %a) #0 {
entry:		entry:
%a.val = load float, float addrspace(1)* %a		%a.val = load float, float addrspace(1)* %a
%a.fabs = call float @llvm.fabs.f32(float %a.val)		%a.fabs = call float @llvm.fabs.f32(float %a.val)
%r.val = fptrunc float %a.fabs to half		%r.val = fptrunc float %a.fabs to half
Show All 27 Lines

test/CodeGen/AMDGPU/mad-mix-lo.ll

Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	define <4 x half> @v_mad_mix_v4f32_clamp_precvt(<4 x half> %src0, <4 x half> %src1, <4 x half> %src2) #0 {
%src2.ext = fpext <4 x half> %src2 to <4 x float>		%src2.ext = fpext <4 x half> %src2 to <4 x float>
%result = tail call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %src0.ext, <4 x float> %src1.ext, <4 x float> %src2.ext)		%result = tail call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %src0.ext, <4 x float> %src1.ext, <4 x float> %src2.ext)
%max = call <4 x float> @llvm.maxnum.v4f32(<4 x float> %result, <4 x float> zeroinitializer)		%max = call <4 x float> @llvm.maxnum.v4f32(<4 x float> %result, <4 x float> zeroinitializer)
%clamp = call <4 x float> @llvm.minnum.v4f32(<4 x float> %max, <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>)		%clamp = call <4 x float> @llvm.minnum.v4f32(<4 x float> %max, <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>)
%cvt.result = fptrunc <4 x float> %clamp to <4 x half>		%cvt.result = fptrunc <4 x float> %clamp to <4 x half>
ret <4 x half> %cvt.result		ret <4 x half> %cvt.result
}		}

		; GCN-LABEL: mixlo_zext:
		; GCN: s_waitcnt
		; GFX9-NEXT: v_mad_mixlo_f16 v0, v0, v1, v2{{$}}
		; GFX9-NEXT: v_and_b32_e32 v0, 0xffff, v0
		; GFX9-NEXT: s_setpc_b64

		; CIVI: v_mac_f32_e32
		; CIVI: v_cvt_f16_f32_e32
		define i32 @mixlo_zext(float %src0, float %src1, float %src2) #0 {
		%result = call float @llvm.fmuladd.f32(float %src0, float %src1, float %src2)
		%cvt.result = fptrunc float %result to half
		%cvt.result.i16 = bitcast half %cvt.result to i16
		%cvt.result.i32 = zext i16 %cvt.result.i16 to i32
		ret i32 %cvt.result.i32
		}

declare half @llvm.minnum.f16(half, half) #1		declare half @llvm.minnum.f16(half, half) #1
declare <2 x half> @llvm.minnum.v2f16(<2 x half>, <2 x half>) #1		declare <2 x half> @llvm.minnum.v2f16(<2 x half>, <2 x half>) #1
declare <3 x half> @llvm.minnum.v3f16(<3 x half>, <3 x half>) #1		declare <3 x half> @llvm.minnum.v3f16(<3 x half>, <3 x half>) #1
declare <4 x half> @llvm.minnum.v4f16(<4 x half>, <4 x half>) #1		declare <4 x half> @llvm.minnum.v4f16(<4 x half>, <4 x half>) #1

declare half @llvm.maxnum.f16(half, half) #1		declare half @llvm.maxnum.f16(half, half) #1
declare <2 x half> @llvm.maxnum.v2f16(<2 x half>, <2 x half>) #1		declare <2 x half> @llvm.maxnum.v2f16(<2 x half>, <2 x half>) #1
declare <3 x half> @llvm.maxnum.v3f16(<3 x half>, <3 x half>) #1		declare <3 x half> @llvm.maxnum.v3f16(<3 x half>, <3 x half>) #1
Show All 19 Lines