Download Raw Diff

Details

Reviewers

arsenm
Joe_Nash
Petar.Avramovic

Commits

rG3ba8dabbf31b: [AMDGPU] Add sdot4 / sdot8 intrinsics for gfx11

Summary

This provides a uniform way to lower into the relevant instructions across all generations.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jrbyrnes created this revision.Aug 21 2023, 5:21 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2023, 5:21 PM

Herald added subscribers: foad, kerbowa, hiraditya and 5 others. · View Herald Transcript

jrbyrnes requested review of this revision.Aug 21 2023, 5:21 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2023, 5:21 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

jrbyrnes added a child revision: D155995: [AMDGPU]: Allow combining into v_dot4.Aug 21 2023, 5:21 PM

Harbormaster completed remote builds in B253955: Diff 552172.Aug 21 2023, 6:14 PM

Title is confusing, this isn't adding new intrinsics

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	I don't understand how these cases are different, the intrinsic name is just slightly different from the instruction name?

jrbyrnes retitled this revision from [AMDGPU] Add sdot4 / sdot8 intrinsics for gfx11 to [AMDGPU] Support sdot4 / sdot8 intrinsics on gfx11.Aug 22 2023, 11:09 AM

jrbyrnes added inline comments.Aug 22 2023, 11:23 AM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	On all other targets with 8bit and 4bit signed dot, we codegen for int_amdgcn_sdot4 and int_amdgcn_sdot8. However, we don't support these on gfx1100 -- instead, gfx100 has int_amdgcn_sUdot4 / int_amdgcn_sUdot8. The result is that users of these intrinsics must always check the target to use the corresponding one (sudot4 for gfx1100, and sdot4 for all others). This removes that responsibility from the user, so they are able to use sdot4 across all targets and generate the corresponding instructions.

arsenm added inline comments.Aug 22 2023, 12:05 PM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	Are there unit tests for these somewhere? I don't really know the full history of these instructions and I'm worried there was some random edge case behavior change

Properly handle neg modifier

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	Apologies, It is my mistake potentially causing confusion. The main difference between V_DOT4_I32_IU8 on gfx1100 and V_DOT4_I32_I8 on gfx90a (for example), is that V_DOT4_I32_IU8 can be either signed or unsigned depending on NEG bit in operand modifier. This target specific feature is probably why there is special handling. See llvm.amdgcn.sudot4 for unit tests.

jrbyrnes added a reviewer: Joe_Nash.Aug 22 2023, 12:20 PM

arsenm added inline comments.Aug 22 2023, 12:24 PM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	I mean tests that actually execute, not lit tests

jrbyrnes mentioned this in D155995: [AMDGPU]: Allow combining into v_dot4.Aug 22 2023, 12:45 PM

Harbormaster completed remote builds in B254163: Diff 552466.Aug 22 2023, 1:09 PM

jrbyrnes added inline comments.Aug 23 2023, 9:45 AM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	So I've tracked down some unit tests. https://github.com/ROCm-Developer-Tools/HIP/blob/b8965f1f3d58d7adf7d702c09e75ebf3dd718f8c/tests/src/deviceLib/hipTestDotFunctions.cpp#L34 These calls are implemented as calls to __ockl_sdot4: https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/hipamd/include/hip/amd_detail/amd_math_functions.h#L148C60-L148C60 Which is, in turn, implemented as calls to target specific builtins: https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/46939af92ad91238c878a82aad2220822073ffa1/ockl/src/dots.cl#L124 For gfx1100, this lowers to __builtin_amdgcn_sudot4 builtin. If you want, I can hack a compiler to lower the __builtin_amdgcn_sudot4 into int_amdgcn_sdot4 and find a way to run these tests.

jrbyrnes added inline comments.Aug 23 2023, 9:48 AM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	Probably worth mentioning is that I have been validating correctness using CK 8 bit and 16 bit test suite, which -- due to https://reviews.llvm.org/D155995 -- has many existing tests that lower into int_amdgcn_sdot4 for gfx1100.

arsenm accepted this revision.Aug 23 2023, 2:43 PM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	Ugh, this test is bad. It barely tests it compiles. Really these should test all the edge cases
437–452	So apparently we have overlapping intrinsics. We should probably canonicalize llvm.amdgcn.sudot4 cases representable with sdot/udot in AMDGPUInstCombineIntrinsic

This revision is now accepted and ready to land.Aug 23 2023, 2:43 PM

arsenm added inline comments.Aug 23 2023, 2:44 PM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
437–452	Not even that, this is barely a front end test. the optimizer can delete most all of this

Can you also document these intrinsics in AMDGPUUsage?

Add note for intrinsics

Include comments about dot*c, remove unintended changes to test

arsenm accepted this revision.Aug 24 2023, 12:45 PM

arsenm added inline comments.

llvm/docs/AMDGPUUsage.rst
1062 ↗	(On Diff #553196)	Think this needs a new line separator

Harbormaster completed remote builds in B254674: Diff 553196.Aug 24 2023, 1:03 PM

Seems like a good convenience feature. LGTM

This revision was landed with ongoing or failed builds.Aug 25 2023, 11:46 AM

Closed by commit rG3ba8dabbf31b: [AMDGPU] Add sdot4 / sdot8 intrinsics for gfx11 (authored by jrbyrnes). · Explain Why

This revision was automatically updated to reflect the committed changes.

jrbyrnes added a commit: rG3ba8dabbf31b: [AMDGPU] Add sdot4 / sdot8 intrinsics for gfx11.

Diff 552172

llvm/lib/Target/AMDGPU/VOP3PInstructions.td

Show First 20 Lines • Show All 428 Lines • ▼ Show 20 Lines	def : GCNPat < (intrinsic_node (DotIUVOP3PMods i32:$src0_mods), i32:$src0,
i32:$src2, (i1 timm:$clamp)),		i32:$src2, (i1 timm:$clamp)),
(!cast<Instruction>(NAME) $src0_mods, i32:$src0,		(!cast<Instruction>(NAME) $src0_mods, i32:$src0,
$src1_mods, i32:$src1,		$src1_mods, i32:$src1,
(i32 8), i32:$src2, i1:$clamp)		(i32 8), i32:$src2, i1:$clamp)
>;		>;
}		}

let SubtargetPredicate = HasDot8Insts in {		let SubtargetPredicate = HasDot8Insts in {
defm V_DOT4_I32_IU8 : VOP3PDOTIUInst<"v_dot4_i32_iu8", int_amdgcn_sudot4>;		defm V_DOT4_I32_IU8 : VOP3PDOTIUInst<"v_dot4_i32_iu8", int_amdgcn_sudot4>;
defm V_DOT8_I32_IU4 : VOP3PDOTIUInst<"v_dot8_i32_iu4", int_amdgcn_sudot8>;		defm V_DOT8_I32_IU4 : VOP3PDOTIUInst<"v_dot8_i32_iu4", int_amdgcn_sudot8>;

		def : GCNPat < (int_amdgcn_sdot8 i32:$src0,
		i32:$src1,
		i32:$src2, (i1 timm:$clamp)),
		(V_DOT8_I32_IU4 (i32 8), i32:$src0,
		(i32 8), i32:$src1, (i32 8), i32:$src2, i1:$clamp)
		>;

		def : GCNPat < (int_amdgcn_sdot4 i32:$src0,
		i32:$src1,
		i32:$src2, (i1 timm:$clamp)),
		(V_DOT4_I32_IU8 (i32 8), i32:$src0,
		(i32 8), i32:$src1, (i32 8), i32:$src2, i1:$clamp)
		>;
		arsenmUnsubmitted Not Done Reply Inline Actions I don't understand how these cases are different, the intrinsic name is just slightly different from the instruction name? arsenm: I don't understand how these cases are different, the intrinsic name is just slightly different…
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions On all other targets with 8bit and 4bit signed dot, we codegen for int_amdgcn_sdot4 and int_amdgcn_sdot8. However, we don't support these on gfx1100 -- instead, gfx100 has int_amdgcn_sUdot4 / int_amdgcn_sUdot8. The result is that users of these intrinsics must always check the target to use the corresponding one (sudot4 for gfx1100, and sdot4 for all others). This removes that responsibility from the user, so they are able to use sdot4 across all targets and generate the corresponding instructions. jrbyrnes: On all other targets with 8bit and 4bit signed dot, we codegen for int_amdgcn_sdot4 and…
		arsenmUnsubmitted Not Done Reply Inline Actions Are there unit tests for these somewhere? I don't really know the full history of these instructions and I'm worried there was some random edge case behavior change arsenm: Are there unit tests for these somewhere? I don't really know the full history of these…
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Apologies, It is my mistake potentially causing confusion. The main difference between V_DOT4_I32_IU8 on gfx1100 and V_DOT4_I32_I8 on gfx90a (for example), is that V_DOT4_I32_IU8 can be either signed or unsigned depending on NEG bit in operand modifier. This target specific feature is probably why there is special handling. See llvm.amdgcn.sudot4 for unit tests. jrbyrnes: Apologies, It is my mistake potentially causing confusion. The main difference between…
		arsenmUnsubmitted Not Done Reply Inline Actions I mean tests that actually execute, not lit tests arsenm: I mean tests that actually execute, not lit tests
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions So I've tracked down some unit tests. https://github.com/ROCm-Developer-Tools/HIP/blob/b8965f1f3d58d7adf7d702c09e75ebf3dd718f8c/tests/src/deviceLib/hipTestDotFunctions.cpp#L34 These calls are implemented as calls to __ockl_sdot4: https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/hipamd/include/hip/amd_detail/amd_math_functions.h#L148C60-L148C60 Which is, in turn, implemented as calls to target specific builtins: https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/46939af92ad91238c878a82aad2220822073ffa1/ockl/src/dots.cl#L124 For gfx1100, this lowers to __builtin_amdgcn_sudot4 builtin. If you want, I can hack a compiler to lower the __builtin_amdgcn_sudot4 into int_amdgcn_sdot4 and find a way to run these tests. jrbyrnes: So I've tracked down some unit tests. https://github.com/ROCm-Developer…
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Probably worth mentioning is that I have been validating correctness using CK 8 bit and 16 bit test suite, which -- due to https://reviews.llvm.org/D155995 -- has many existing tests that lower into int_amdgcn_sdot4 for gfx1100. jrbyrnes: Probably worth mentioning is that I have been validating correctness using CK 8 bit and 16 bit…
		arsenmUnsubmitted Not Done Reply Inline Actions So apparently we have overlapping intrinsics. We should probably canonicalize llvm.amdgcn.sudot4 cases representable with sdot/udot in AMDGPUInstCombineIntrinsic arsenm: So apparently we have overlapping intrinsics. We should probably canonicalize llvm.amdgcn.
		arsenmUnsubmitted Not Done Reply Inline Actions Ugh, this test is bad. It barely tests it compiles. Really these should test all the edge cases arsenm: Ugh, this test is bad. It barely tests it compiles. Really these should test all the edge cases
		arsenmUnsubmitted Not Done Reply Inline Actions Not even that, this is barely a front end test. the optimizer can delete most all of this arsenm: Not even that, this is barely a front end test. the optimizer can delete most all of this
} // End SubtargetPredicate = HasDot8Insts		} // End SubtargetPredicate = HasDot8Insts

def : UDot2Pat<V_DOT2_U32_U16>;		def : UDot2Pat<V_DOT2_U32_U16>;
def : SDot2Pat<V_DOT2_I32_I16>;		def : SDot2Pat<V_DOT2_I32_I16>;

foreach Type = ["U", "I"] in		foreach Type = ["U", "I"] in
let SubtargetPredicate = !cast<VOP_Pseudo>("V_DOT4_"#Type#"32_"#Type#8).SubtargetPredicate in		let SubtargetPredicate = !cast<VOP_Pseudo>("V_DOT4_"#Type#"32_"#Type#8).SubtargetPredicate in
def : GCNPat <		def : GCNPat <
▲ Show 20 Lines • Show All 828 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sdot4.ll

	; RUN: llc -march=amdgcn -mcpu=gfx906 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX906			; RUN: llc -march=amdgcn -mcpu=gfx906 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX906
	; RUN: llc -march=amdgcn -mcpu=gfx1011 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10			; RUN: llc -march=amdgcn -mcpu=gfx1011 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10
	; RUN: llc -march=amdgcn -mcpu=gfx1012 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10			; RUN: llc -march=amdgcn -mcpu=gfx1012 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10
	; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10			; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10
	; RUN: llc -march=amdgcn -mcpu=gfx1031 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10			; RUN: llc -march=amdgcn -mcpu=gfx1031 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GFX11

	declare i32 @llvm.amdgcn.sdot4(i32 %a, i32 %b, i32 %c, i1 %clamp)			declare i32 @llvm.amdgcn.sdot4(i32 %a, i32 %b, i32 %c, i1 %clamp)

	; GCN-LABEL: {{^}}test_llvm_amdgcn_sdot4_clamp			; GCN-LABEL: {{^}}test_llvm_amdgcn_sdot4_clamp
	; GFX906: v_dot4_i32_i8 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}			; GFX906: v_dot4_i32_i8 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}
	; GFX10: v_dot4_i32_i8 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}			; GFX10: v_dot4_i32_i8 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}
				; GFX11: v_dot4_i32_iu8 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}
	define amdgpu_kernel void @test_llvm_amdgcn_sdot4_clamp(			define amdgpu_kernel void @test_llvm_amdgcn_sdot4_clamp(
	ptr addrspace(1) %r,			ptr addrspace(1) %r,
	ptr addrspace(1) %a,			ptr addrspace(1) %a,
	ptr addrspace(1) %b,			ptr addrspace(1) %b,
	ptr addrspace(1) %c) {			ptr addrspace(1) %c) {
	entry:			entry:
	%a.val = load <4 x i8>, ptr addrspace(1) %a			%a.val = load <4 x i8>, ptr addrspace(1) %a
	%b.val = load <4 x i8>, ptr addrspace(1) %b			%b.val = load <4 x i8>, ptr addrspace(1) %b
	%a.val.cast = bitcast <4 x i8> %a.val to i32			%a.val.cast = bitcast <4 x i8> %a.val to i32
	%b.val.cast = bitcast <4 x i8> %b.val to i32			%b.val.cast = bitcast <4 x i8> %b.val to i32
	%c.val = load i32, ptr addrspace(1) %c			%c.val = load i32, ptr addrspace(1) %c
	%r.val = call i32 @llvm.amdgcn.sdot4(i32 %a.val.cast, i32 %b.val.cast, i32 %c.val, i1 1)			%r.val = call i32 @llvm.amdgcn.sdot4(i32 %a.val.cast, i32 %b.val.cast, i32 %c.val, i1 1)
	store i32 %r.val, ptr addrspace(1) %r			store i32 %r.val, ptr addrspace(1) %r
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}test_llvm_amdgcn_sdot4_no_clamp			; GCN-LABEL: {{^}}test_llvm_amdgcn_sdot4_no_clamp
	; GFX906: v_dot4_i32_i8 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}{{$}}			; GFX906: v_dot4_i32_i8 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}{{$}}
	; GFX10: v_dot4c_i32_i8_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}{{$}}			; GFX10: v_dot4c_i32_i8_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}{{$}}
				; GF11: v_dot4_i32_iu8 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}{{$}}
	define amdgpu_kernel void @test_llvm_amdgcn_sdot4_no_clamp(			define amdgpu_kernel void @test_llvm_amdgcn_sdot4_no_clamp(
	ptr addrspace(1) %r,			ptr addrspace(1) %r,
	ptr addrspace(1) %a,			ptr addrspace(1) %a,
	ptr addrspace(1) %b,			ptr addrspace(1) %b,
	ptr addrspace(1) %c) {			ptr addrspace(1) %c) {
	entry:			entry:
	%a.val = load <4 x i8>, ptr addrspace(1) %a			%a.val = load <4 x i8>, ptr addrspace(1) %a
	%b.val = load <4 x i8>, ptr addrspace(1) %b			%b.val = load <4 x i8>, ptr addrspace(1) %b
	%a.val.cast = bitcast <4 x i8> %a.val to i32			%a.val.cast = bitcast <4 x i8> %a.val to i32
	%b.val.cast = bitcast <4 x i8> %b.val to i32			%b.val.cast = bitcast <4 x i8> %b.val to i32
	%c.val = load i32, ptr addrspace(1) %c			%c.val = load i32, ptr addrspace(1) %c
	%r.val = call i32 @llvm.amdgcn.sdot4(i32 %a.val.cast, i32 %b.val.cast, i32 %c.val, i1 0)			%r.val = call i32 @llvm.amdgcn.sdot4(i32 %a.val.cast, i32 %b.val.cast, i32 %c.val, i1 0)
	store i32 %r.val, ptr addrspace(1) %r			store i32 %r.val, ptr addrspace(1) %r
	ret void			ret void
	}			}

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sdot8.ll

	; RUN: llc -march=amdgcn -mcpu=gfx906 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX906			; RUN: llc -march=amdgcn -mcpu=gfx906 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX906
	; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX908			; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX908
	; RUN: llc -march=amdgcn -mcpu=gfx1011 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10			; RUN: llc -march=amdgcn -mcpu=gfx1011 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10
	; RUN: llc -march=amdgcn -mcpu=gfx1012 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10			; RUN: llc -march=amdgcn -mcpu=gfx1012 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10
	; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10			; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10
	; RUN: llc -march=amdgcn -mcpu=gfx1031 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10			; RUN: llc -march=amdgcn -mcpu=gfx1031 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GCN,GFX10
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=GFX11

	declare i32 @llvm.amdgcn.sdot8(i32 %a, i32 %b, i32 %c, i1 %clamp)			declare i32 @llvm.amdgcn.sdot8(i32 %a, i32 %b, i32 %c, i1 %clamp)

	; GCN-LABEL: {{^}}test_llvm_amdgcn_sdot8_clamp			; GCN-LABEL: {{^}}test_llvm_amdgcn_sdot8_clamp
	; GFX906: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}			; GFX906: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}
	; GFX908: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}			; GFX908: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}
	; GFX10: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}			; GFX10: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}
				; GFX11: v_dot8_i32_iu4 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}} clamp{{$}}
	define amdgpu_kernel void @test_llvm_amdgcn_sdot8_clamp(			define amdgpu_kernel void @test_llvm_amdgcn_sdot8_clamp(
	ptr addrspace(1) %r,			ptr addrspace(1) %r,
	ptr addrspace(1) %a,			ptr addrspace(1) %a,
	ptr addrspace(1) %b,			ptr addrspace(1) %b,
	ptr addrspace(1) %c) {			ptr addrspace(1) %c) {
	entry:			entry:
	%a.val = load <8 x i4>, ptr addrspace(1) %a			%a.val = load <8 x i4>, ptr addrspace(1) %a
	%b.val = load <8 x i4>, ptr addrspace(1) %b			%b.val = load <8 x i4>, ptr addrspace(1) %b
	%a.val.cast = bitcast <8 x i4> %a.val to i32			%a.val.cast = bitcast <8 x i4> %a.val to i32
	%b.val.cast = bitcast <8 x i4> %b.val to i32			%b.val.cast = bitcast <8 x i4> %b.val to i32
	%c.val = load i32, ptr addrspace(1) %c			%c.val = load i32, ptr addrspace(1) %c
	%r.val = call i32 @llvm.amdgcn.sdot8(i32 %a.val.cast, i32 %b.val.cast, i32 %c.val, i1 1)			%r.val = call i32 @llvm.amdgcn.sdot8(i32 %a.val.cast, i32 %b.val.cast, i32 %c.val, i1 1)
	store i32 %r.val, ptr addrspace(1) %r			store i32 %r.val, ptr addrspace(1) %r
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}test_llvm_amdgcn_sdot8_no_clamp			; GCN-LABEL: {{^}}test_llvm_amdgcn_sdot8_no_clamp
	; GFX906: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}{{$}}			; GFX906: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}{{$}}
	; GFX908: v_dot8c_i32_i4_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}{{$}}			; GFX908: v_dot8c_i32_i4_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}{{$}}
	; GFX10: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}{{$}}			; GFX10: v_dot8_i32_i4 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}{{$}}
				; GFX11: v_dot8_i32_iu4 v{{[0-9]+}}, s{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}{{$}}
	define amdgpu_kernel void @test_llvm_amdgcn_sdot8_no_clamp(			define amdgpu_kernel void @test_llvm_amdgcn_sdot8_no_clamp(
	ptr addrspace(1) %r,			ptr addrspace(1) %r,
	ptr addrspace(1) %a,			ptr addrspace(1) %a,
	ptr addrspace(1) %b,			ptr addrspace(1) %b,
	ptr addrspace(1) %c) {			ptr addrspace(1) %c) {
	entry:			entry:
	%a.val = load <8 x i4>, ptr addrspace(1) %a			%a.val = load <8 x i4>, ptr addrspace(1) %a
	%b.val = load <8 x i4>, ptr addrspace(1) %b			%b.val = load <8 x i4>, ptr addrspace(1) %b
	%a.val.cast = bitcast <8 x i4> %a.val to i32			%a.val.cast = bitcast <8 x i4> %a.val to i32
	%b.val.cast = bitcast <8 x i4> %b.val to i32			%b.val.cast = bitcast <8 x i4> %b.val to i32
	%c.val = load i32, ptr addrspace(1) %c			%c.val = load i32, ptr addrspace(1) %c
	%r.val = call i32 @llvm.amdgcn.sdot8(i32 %a.val.cast, i32 %b.val.cast, i32 %c.val, i1 0)			%r.val = call i32 @llvm.amdgcn.sdot8(i32 %a.val.cast, i32 %b.val.cast, i32 %c.val, i1 0)
	store i32 %r.val, ptr addrspace(1) %r			store i32 %r.val, ptr addrspace(1) %r
	ret void			ret void
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support sdot4 / sdot8 intrinsics on gfx11
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 552172

llvm/lib/Target/AMDGPU/VOP3PInstructions.td

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sdot4.ll

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sdot8.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support sdot4 / sdot8 intrinsics on gfx11ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 552172

llvm/lib/Target/AMDGPU/VOP3PInstructions.td

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sdot4.ll

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sdot8.ll

[AMDGPU] Support sdot4 / sdot8 intrinsics on gfx11
ClosedPublic