This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Match initial thread pattern on AMDGPU
AbandonedPublic

Authored by jhuber6 on Jun 25 2021, 6:35 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield

Summary

The AAExecutionDomain pass used to push globalized memory calls to
global shared memory doesn't match the pattern AMDGPU generates. This
means the optimizations won't work on anything other than an NVPTX
target. This patch adds AMDGPU's pattern to the check.

Depends on D102423

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jun 25 2021, 6:35 AM

Herald added subscribers: ormris, guansong, hiraditya and 5 others. · View Herald TranscriptJun 25 2021, 6:35 AM

jhuber6 requested review of this revision.Jun 25 2021, 6:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2021, 6:35 AM

Herald added subscribers: llvm-commits, sstefan1, wdng. · View Herald Transcript

Ah, nice catch. I have not been paying enough attention to OpenMPOpt, this pattern will indeed miss on amdgcn.

__kmpc_amdgcn_gpu_num_threads is a library function because there is no corresponding intrinsic. I think each of the nvptx intrinsics has either a corresponding amdgcn intrinsic or a corresponding function call, but there might also be some things that are a scalar constant on one arch and a function returning a constant on the other.

For this patch, I'm wondering if we can use a single pattern, preceded by:
auto &&m_BlockSize = nvidia ? m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_ntid_x>() : m_Intrinsic<Intrinsic::some-amd-name>();

In the general case, I'd like the Opt layer to be more architecture agnostic than this. Could we insert functions at codegen like 'amdgpu_get_block_size', pattern match those in the IR opt, and lower them to the nvptx or amdgcn intrinsics later on?

In D104911#2840699, @JonChesterfield wrote:

Ah, nice catch. I have not been paying enough attention to OpenMPOpt, this pattern will indeed miss on amdgcn.

__kmpc_amdgcn_gpu_num_threads is a library function because there is no corresponding intrinsic. I think each of the nvptx intrinsics has either a corresponding amdgcn intrinsic or a corresponding function call, but there might also be some things that are a scalar constant on one arch and a function returning a constant on the other.

I was debating whether or not to enter this as an intrinsic or at least an RTL function in OMPKinds.def and just settled on this ugly string comparison.

For this patch, I'm wondering if we can use a single pattern, preceded by:
auto &&m_BlockSize = nvidia ? m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_ntid_x>() : m_Intrinsic<Intrinsic::some-amd-name>();

The patterns are slightly different without the difference in finding the block size, AMD uses a constant bit mask while Nvidia derives it from the warp size.

In the general case, I'd like the Opt layer to be more architecture agnostic than this. Could we insert functions at codegen like 'amdgpu_get_block_size', pattern match those in the IR opt, and lower them to the nvptx or amdgcn intrinsics later on?

This is planned when we switch over to the new device runtime library where it will be a simple comparison on a TID function to zero. Right now it needs to do some weird calls to determine if a thread is inside the "master warp" for the runtime library.

Harbormaster completed remote builds in B110987: Diff 354484.Jun 25 2021, 7:21 AM

As part of D101976 I will change the matching to be target independent. This is not needed.

ormris removed a subscriber: ormris.Jun 29 2021, 10:20 AM

jdoerfert mentioned this in D101976: [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL.Jun 29 2021, 4:10 PM

jhuber6 abandoned this revision.Jul 10 2021, 7:08 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

OpenMPOpt.cpp

15 lines

test/

Transforms/

OpenMP/

replace_globalization.ll

29 lines

Diff 354484

llvm/lib/Transforms/IPO/OpenMPOpt.cpp

Show First 20 Lines • Show All 2,351 Lines • ▼ Show 20 Lines	if (Edge->getSuccessor(0) != SuccessorBB)
return false;		return false;

auto *Cmp = dyn_cast<CmpInst>(Edge->getCondition());		auto *Cmp = dyn_cast<CmpInst>(Edge->getCondition());
if (!Cmp \|\| !Cmp->isTrueWhenEqual() \|\| !Cmp->isEquality())		if (!Cmp \|\| !Cmp->isTrueWhenEqual() \|\| !Cmp->isEquality())
return false;		return false;

// Temporarily match the pattern generated by clang for teams regions.		// Temporarily match the pattern generated by clang for teams regions.
// TODO: Remove this once the new runtime is in place.		// TODO: Remove this once the new runtime is in place.
ConstantInt One, NegOne;		ConstantInt One, NegOne, *NegSixtyFour;
		Value *NumThreads;
CmpInst::Predicate Pred;		CmpInst::Predicate Pred;
auto &&m_ThreadID = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_tid_x>();		auto &&m_ThreadID = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_tid_x>();
auto &&m_WarpSize = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_warpsize>();		auto &&m_WarpSize = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_warpsize>();
auto &&m_BlockSize = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_ntid_x>();		auto &&m_BlockSize = m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_ntid_x>();
		auto &&m_AMDThreadID = m_Intrinsic<Intrinsic::amdgcn_workitem_id_x>();
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'm_AMDThreadID' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'm_AMDThreadID' [readability-identifier…
if (match(Cmp, m_Cmp(Pred, m_ThreadID,		if (match(Cmp, m_Cmp(Pred, m_ThreadID,
m_And(m_Sub(m_BlockSize, m_ConstantInt(One)),		m_And(m_Sub(m_BlockSize, m_ConstantInt(One)),
m_Xor(m_Sub(m_WarpSize, m_ConstantInt(One)),		m_Xor(m_Sub(m_WarpSize, m_ConstantInt(One)),
m_ConstantInt(NegOne))))))		m_ConstantInt(NegOne))))))
if (One->isOne() && NegOne->isMinusOne() &&		if (One->isOne() && NegOne->isMinusOne() &&
Pred == CmpInst::Predicate::ICMP_EQ)		Pred == CmpInst::Predicate::ICMP_EQ)
return true;		return true;

		// Match the same pattern for AMDGPU.
		if (match(Cmp, m_Cmp(Pred, m_AMDThreadID,
		m_And(m_Sub(m_Value(NumThreads), m_ConstantInt(One)),
		m_ConstantInt(NegSixtyFour)))))
		if (One->isOne() && (NegSixtyFour->getSExtValue() == -64) &&
		Pred == CmpInst::Predicate::ICMP_EQ)
		if (isa<CallBase>(NumThreads) &&
		dyn_cast<CallBase>(NumThreads)->getCalledFunction()->getName() ==
		"__kmpc_amdgcn_gpu_num_threads")
		return true;

ConstantInt *C = dyn_cast<ConstantInt>(Cmp->getOperand(1));		ConstantInt *C = dyn_cast<ConstantInt>(Cmp->getOperand(1));
if (!C \|\| !C->isZero())		if (!C \|\| !C->isZero())
return false;		return false;

if (auto *II = dyn_cast<IntrinsicInst>(Cmp->getOperand(0)))		if (auto *II = dyn_cast<IntrinsicInst>(Cmp->getOperand(0)))
if (II->getIntrinsicID() == Intrinsic::nvvm_read_ptx_sreg_tid_x)		if (II->getIntrinsicID() == Intrinsic::nvvm_read_ptx_sreg_tid_x)
return true;		return true;
if (auto *II = dyn_cast<IntrinsicInst>(Cmp->getOperand(0)))		if (auto *II = dyn_cast<IntrinsicInst>(Cmp->getOperand(0)))
▲ Show 20 Lines • Show All 482 Lines • Show Last 20 Lines

llvm/test/Transforms/OpenMP/replace_globalization.ll

	; RUN: opt -S -passes='openmp-opt' < %s \| FileCheck %s			; RUN: opt -S -passes='openmp-opt' < %s \| FileCheck %s
	; RUN: opt -passes=openmp-opt -pass-remarks=openmp-opt -disable-output < %s 2>&1 \| FileCheck %s -check-prefix=CHECK-REMARKS			; RUN: opt -passes=openmp-opt -pass-remarks=openmp-opt -disable-output < %s 2>&1 \| FileCheck %s -check-prefix=CHECK-REMARKS
	target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"			target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
	target triple = "nvptx64"			target triple = "nvptx64"

	@S = external local_unnamed_addr global i8*			@S = external local_unnamed_addr global i8*

	; CHECK-REMARKS: remark: replace_globalization.c:5:7: Replaced globalized variable with 16 bytes of shared memory			; CHECK-REMARKS: remark: replace_globalization.c:5:7: Replaced globalized variable with 16 bytes of shared memory
	; CHECK-REMARKS: remark: replace_globalization.c:5:14: Replaced globalized variable with 4 bytes of shared memory			; CHECK-REMARKS: remark: replace_globalization.c:5:14: Replaced globalized variable with 4 bytes of shared memory
				; CHECK-REMARKS: remark: replace_globalization.c:5:14: Replaced globalized variable with 4 bytes of shared memory
	; CHECK: [[SHARED_X:@.+]] = internal addrspace(3) global [16 x i8] undef			; CHECK: [[SHARED_X:@.+]] = internal addrspace(3) global [16 x i8] undef
	; CHECK: [[SHARED_Y:@.+]] = internal addrspace(3) global [4 x i8] undef			; CHECK: [[SHARED_Y:@.+]] = internal addrspace(3) global [4 x i8] undef
				; CHECK: [[SHARED_Z:@.+]] = internal addrspace(3) global [4 x i8] undef

	; CHECK: %{{.}} = call i8 @__kmpc_alloc_shared({{.*}})			; CHECK: %{{.}} = call i8 @__kmpc_alloc_shared({{.*}})
	; CHECK: call void @__kmpc_free_shared({{.*}})			; CHECK: call void @__kmpc_free_shared({{.*}})
	define dso_local void @foo() {			define dso_local void @foo() {
	entry:			entry:
	%x = call i8* @__kmpc_alloc_shared(i64 4)			%x = call i8* @__kmpc_alloc_shared(i64 4)
	%x_on_stack = bitcast i8* %x to i32*			%x_on_stack = bitcast i8* %x to i32*
	%0 = bitcast i32* %x_on_stack to i8*			%0 = bitcast i32* %x_on_stack to i8*
	call void @use(i8* %0)			call void @use(i8* %0)
	call void @__kmpc_free_shared(i8* %x)			call void @__kmpc_free_shared(i8* %x)
	ret void			ret void
	}			}

	define void @bar() {			define void @bar() {
	call void @baz()			call void @baz()
	call void @qux()			call void @nvidia()
				call void @amd()
	ret void			ret void
	}			}

	; CHECK: %{{.}} = bitcast i8 addrspacecast (i8 addrspace(3)* getelementptr inbounds ([16 x i8], [16 x i8] addrspace(3)* [[SHARED_X]], i32 0, i32 0) to i8) to [4 x i32]			; CHECK: %{{.}} = bitcast i8 addrspacecast (i8 addrspace(3)* getelementptr inbounds ([16 x i8], [16 x i8] addrspace(3)* [[SHARED_X]], i32 0, i32 0) to i8) to [4 x i32]
	define internal void @baz() {			define internal void @baz() {
	entry:			entry:
	%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()			%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
	%cmp = icmp eq i32 %tid, 0			%cmp = icmp eq i32 %tid, 0
	br i1 %cmp, label %master, label %exit			br i1 %cmp, label %master, label %exit
	master:			master:
	%x = call i8* @__kmpc_alloc_shared(i64 16), !dbg !11			%x = call i8* @__kmpc_alloc_shared(i64 16), !dbg !11
	%x_on_stack = bitcast i8* %x to [4 x i32]*			%x_on_stack = bitcast i8* %x to [4 x i32]*
	%0 = bitcast [4 x i32]* %x_on_stack to i8*			%0 = bitcast [4 x i32]* %x_on_stack to i8*
	call void @use(i8* %0)			call void @use(i8* %0)
	call void @__kmpc_free_shared(i8* %x)			call void @__kmpc_free_shared(i8* %x)
	br label %exit			br label %exit
	exit:			exit:
	ret void			ret void
	}			}

	; CHECK: %{{.}} = bitcast i8 addrspacecast (i8 addrspace(3)* getelementptr inbounds ([4 x i8], [4 x i8] addrspace(3)* [[SHARED_Y]], i32 0, i32 0) to i8) to [4 x i32]			; CHECK: %{{.}} = bitcast i8 addrspacecast (i8 addrspace(3)* getelementptr inbounds ([4 x i8], [4 x i8] addrspace(3)* [[SHARED_Y]], i32 0, i32 0) to i8) to [4 x i32]
	define internal void @qux() {			define internal void @nvidia() {
	entry:			entry:
	%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()			%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
	%ntid = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()			%ntid = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
	%warpsize = call i32 @llvm.nvvm.read.ptx.sreg.warpsize()			%warpsize = call i32 @llvm.nvvm.read.ptx.sreg.warpsize()
	%0 = sub nuw i32 %warpsize, 1			%0 = sub nuw i32 %warpsize, 1
	%1 = sub nuw i32 %ntid, 1			%1 = sub nuw i32 %ntid, 1
	%2 = xor i32 %0, -1			%2 = xor i32 %0, -1
	%master_tid = and i32 %1, %2			%master_tid = and i32 %1, %2
	%3 = icmp eq i32 %tid, %master_tid			%3 = icmp eq i32 %tid, %master_tid
	br i1 %3, label %master, label %exit			br i1 %3, label %master, label %exit
	master:			master:
	%y = call i8* @__kmpc_alloc_shared(i64 4), !dbg !12			%y = call i8* @__kmpc_alloc_shared(i64 4), !dbg !12
	%y_on_stack = bitcast i8* %y to [4 x i32]*			%y_on_stack = bitcast i8* %y to [4 x i32]*
	%4 = bitcast [4 x i32]* %y_on_stack to i8*			%4 = bitcast [4 x i32]* %y_on_stack to i8*
	call void @use(i8* %4)			call void @use(i8* %4)
	call void @__kmpc_free_shared(i8* %y)			call void @__kmpc_free_shared(i8* %y)
	br label %exit			br label %exit
	exit:			exit:
	ret void			ret void
	}			}

				; CHECK: %{{.}} = bitcast i8 addrspacecast (i8 addrspace(3)* getelementptr inbounds ([4 x i8], [4 x i8] addrspace(3)* [[SHARED_Z]], i32 0, i32 0) to i8) to [4 x i32]
				define internal void @amd() {
				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%ntid = call i32 @__kmpc_amdgcn_gpu_num_threads()
				%0 = sub nuw i32 %ntid, 1
				%master_tid = and i32 %0, -64
				%1 = icmp eq i32 %tid, %master_tid
				br i1 %1, label %master, label %exit
				master:
				%z = call i8* @__kmpc_alloc_shared(i64 4), !dbg !12
				%z_on_stack = bitcast i8* %z to [4 x i32]*
				%2 = bitcast [4 x i32]* %z_on_stack to i8*
				call void @use(i8* %2)
				call void @__kmpc_free_shared(i8* %z)
				br label %exit
				exit:
				ret void
				}

	define void @use(i8* %x) {			define void @use(i8* %x) {
	entry:			entry:
	store i8* %x, i8** @S			store i8* %x, i8** @S
	ret void			ret void
	}			}

	declare i8* @__kmpc_alloc_shared(i64)			declare i8* @__kmpc_alloc_shared(i64)

	declare void @__kmpc_free_shared(i8*)			declare void @__kmpc_free_shared(i8*)

	declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()			declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()			declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()

	declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()			declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()

				declare i32 @llvm.amdgcn.workitem.id.x()

				declare i32 @__kmpc_amdgcn_gpu_num_threads()

	!llvm.dbg.cu = !{!0}			!llvm.dbg.cu = !{!0}
	!llvm.module.flags = !{!3, !4, !5, !6}			!llvm.module.flags = !{!3, !4, !5, !6}
	!nvvm.annotations = !{!7, !8}			!nvvm.annotations = !{!7, !8}

	!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 12.0.0", isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, enums: !2, splitDebugInlining: false, nameTableKind: None)			!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 12.0.0", isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, enums: !2, splitDebugInlining: false, nameTableKind: None)
	!1 = !DIFile(filename: "replace_globalization.c", directory: "/tmp/replace_globalization.c")			!1 = !DIFile(filename: "replace_globalization.c", directory: "/tmp/replace_globalization.c")
	!2 = !{}			!2 = !{}
	Show All 10 Lines