This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/test/OpenMP/
-
test/
-
OpenMP/
-
amdgcn-attributes.cpp
-
llvm/lib/Frontend/OpenMP/
-
lib/
-
Frontend/
-
OpenMP/
1
OMPIRBuilder.cpp
-
openmp/libomptarget/test/offloading/
-
libomptarget/
-
test/
-
offloading/
-
default_thread_limit.c

Differential D158382

[OpenMP] Use default grid value for static grid size
ClosedPublic

Authored by jdoerfert on Aug 20 2023, 8:06 PM.

Download Raw Diff

Details

Reviewers

ye-luo
jhuber6
tianshilei1992

Commits

rG7481b465ae30: [OpenMP] Use default grid value for static grid size

Summary

If the user did not provide any static clause to override the grid size,
we assume the default grid size as upper bound and use it to improve
code generation through vendor specific attributes.

Fixes: https://github.com/llvm/llvm-project/issues/64816

Diff Detail

Event Timeline

jdoerfert created this revision.Aug 20 2023, 8:06 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2023, 8:06 PM

Herald added subscribers: guansong, bollu, hiraditya and 2 others. · View Herald Transcript

jdoerfert requested review of this revision.Aug 20 2023, 8:06 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2023, 8:06 PM

Herald added subscribers: llvm-commits, jplehr, sstefan1. · View Herald Transcript

jdoerfert added a parent revision: D158381: [OpenMP] Properly set static thread limit (w/o analysis).Aug 20 2023, 8:06 PM

Harbormaster completed remote builds in B253754: Diff 551884.Aug 20 2023, 8:07 PM

jdoerfert mentioned this in D158383: [OpenMP] Add NVIDIA annotations for static grid thread limit.Aug 21 2023, 8:47 AM

jhuber6 added inline comments.Aug 21 2023, 8:50 AM

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
4133	Can't we just get the module from the Function and check the triple?

Is amdgpu-flat-work-group-size a bound or exact value?

In D158382#4603950, @tianshilei1992 wrote:

Is amdgpu-flat-work-group-size a bound or exact value?

It's a range. What really matters is the upper bound, that is the value you see in the test (max flat ...).

Added checks.

LG, thanks

This revision is now accepted and ready to land.Aug 21 2023, 9:24 AM

Harbormaster completed remote builds in B253872: Diff 552052.Aug 21 2023, 9:24 AM

This revision was landed with ongoing or failed builds.Aug 23 2023, 11:13 AM

Closed by commit rG7481b465ae30: [OpenMP] Use default grid value for static grid size (authored by jdoerfert). · Explain Why

This revision was automatically updated to reflect the committed changes.

jdoerfert added a commit: rG7481b465ae30: [OpenMP] Use default grid value for static grid size.

Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2023, 11:13 AM

Herald added a subscriber: openmp-commits. · View Herald Transcript

This patch produces the following difference in IR out of CodeGen.

Without this patch:

%nvptx_num_threads.i = tail call i32 @__kmpc_get_hardware_num_threads_in_block() #2
call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @2 to ptr), i32 %1, i32 91, ptr nonnull %.omp.is_last.ascast.i, ptr nonnull %.omp.comb.lb.ascast.i, ptr nonnull %.omp.comb.ub.ascast.i, ptr nonnull %.omp.stride.ascast.i, i32 1, i32 %nvptx_num_threads.i) #2

With this patch:

call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @2 to ptr), i32 %1, i32 91, ptr nonnull %.omp.is_last.ascast.i, ptr nonnull %.omp.comb.lb.ascast.i, ptr nonnull %.omp.comb.ub.ascast.i, ptr nonnull %.omp.stride.ascast.i, i32 1, i32 256) #2

Setting the blocksize to a constant too early would be a problem if the runtime changes the blocksize, e.g. because of an environment variable or because of a low trip count (D152014). Comments? @jdoerfert

In D158382#4621885, @dhruvachak wrote:
This patch produces the following difference in IR out of CodeGen.

Without this patch:
%nvptx_num_threads.i = tail call i32 @__kmpc_get_hardware_num_threads_in_block() #2
call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @2 to ptr), i32 %1, i32 91, ptr nonnull %.omp.is_last.ascast.i, ptr nonnull %.omp.comb.lb.ascast.i, ptr nonnull %.omp.comb.ub.ascast.i, ptr nonnull %.omp.stride.ascast.i, i32 1, i32 %nvptx_num_threads.i) #2
With this patch:
call void @__kmpc_distribute_static_init_4(ptr addrspacecast (ptr addrspace(1) @2 to ptr), i32 %1, i32 91, ptr nonnull %.omp.is_last.ascast.i, ptr nonnull %.omp.comb.lb.ascast.i, ptr nonnull %.omp.comb.ub.ascast.i, ptr nonnull %.omp.stride.ascast.i, i32 1, i32 256) #2
Setting the blocksize to a constant too early would be a problem if the runtime changes the blocksize, e.g. because of an environment variable or because of a low trip count (D152014). Comments? @jdoerfert

From OpenMP-Opt:

case OMPRTL___kmpc_get_hardware_num_threads_in_block:
   Changed = Changed | foldKernelFnAttribute(A, "omp_target_thread_limit");
   break;

this is wrong. We should fold thread limit, not num_threads_in_block.
The latter can only be folded if we will not lower it, which we currently cannot guarantee.

Revision Contents

Path

Size

clang/

test/

OpenMP/

amdgcn-attributes.cpp

8 lines

llvm/

lib/

Frontend/

OpenMP/

OMPIRBuilder.cpp

19 lines

openmp/

libomptarget/

test/

offloading/

default_thread_limit.c

103 lines

Diff 551884

clang/test/OpenMP/amdgcn-attributes.cpp

Show All 26 Lines	#pragma omp target
return arr[0];		return arr[0];
}		}

int callable(int x) {		int callable(int x) {
// ALL-LABEL: @_Z8callablei(i32 noundef %x) #1		// ALL-LABEL: @_Z8callablei(i32 noundef %x) #1
return x + 1;		return x + 1;
}		}

// DEFAULT: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "kernel" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }		// DEFAULT: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,256" "kernel" "no-trapping-math"="true" "omp_target_thread_limit"="256" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }
// CPU: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "kernel" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx900" "target-features"="+16-bit-insts,+ci-insts,+dpp,+gfx8-insts,+gfx9-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" "uniform-work-group-size"="true" }		// CPU: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,256" "kernel" "no-trapping-math"="true" "omp_target_thread_limit"="256" "stack-protector-buffer-size"="8" "target-cpu"="gfx900" "target-features"="+16-bit-insts,+ci-insts,+dpp,+gfx8-insts,+gfx9-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" "uniform-work-group-size"="true" }
// NOIEEE: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-ieee"="false" "kernel" "no-nans-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }		// NOIEEE: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-ieee"="false" "kernel" "no-nans-fp-math"="true" "no-trapping-math"="true" "omp_target_thread_limit"="256" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }
// UNSAFEATOMIC: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-unsafe-fp-atomics"="true" "kernel" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }		// UNSAFEATOMIC: attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-unsafe-fp-atomics"="true" "kernel" "no-trapping-math"="true" "omp_target_thread_limit"="256" "stack-protector-buffer-size"="8" "uniform-work-group-size"="true" }

// DEFAULT: attributes #1 = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }		// DEFAULT: attributes #1 = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
// CPU: attributes #1 = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx900" "target-features"="+16-bit-insts,+ci-insts,+dpp,+gfx8-insts,+gfx9-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" }		// CPU: attributes #1 = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx900" "target-features"="+16-bit-insts,+ci-insts,+dpp,+gfx8-insts,+gfx9-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" }
// NOIEEE: attributes #1 = { convergent mustprogress noinline nounwind optnone "amdgpu-ieee"="false" "no-nans-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }		// NOIEEE: attributes #1 = { convergent mustprogress noinline nounwind optnone "amdgpu-ieee"="false" "no-nans-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
// UNSAFEATOMIC: attributes #1 = { convergent mustprogress noinline nounwind optnone "amdgpu-unsafe-fp-atomics"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }		// UNSAFEATOMIC: attributes #1 = { convergent mustprogress noinline nounwind optnone "amdgpu-unsafe-fp-atomics"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

Show All 17 Lines
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Bitcode/BitcodeReader.h"		#include "llvm/Bitcode/BitcodeReader.h"
		#include "llvm/Frontend/OpenMP/OMPGridValues.h"
#include "llvm/IR/Attributes.h"		#include "llvm/IR/Attributes.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/CallingConv.h"		#include "llvm/IR/CallingConv.h"
#include "llvm/IR/Constant.h"		#include "llvm/IR/Constant.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/DebugInfoMetadata.h"		#include "llvm/IR/DebugInfoMetadata.h"
#include "llvm/IR/DerivedTypes.h"		#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/GlobalVariable.h"		#include "llvm/IR/GlobalVariable.h"
▲ Show 20 Lines • Show All 4,082 Lines • ▼ Show 20 Lines	if (!updateToLocation(Loc))
return;		return;

Function *Fn = getOrCreateRuntimeFunctionPtr(		Function *Fn = getOrCreateRuntimeFunctionPtr(
omp::RuntimeFunction::OMPRTL___kmpc_target_deinit);		omp::RuntimeFunction::OMPRTL___kmpc_target_deinit);

Builder.CreateCall(Fn, {});		Builder.CreateCall(Fn, {});
}		}

		static const omp::GV &getGridValue(Function *Kernel) {
		if (Kernel->getCallingConv() == CallingConv::AMDGPU_KERNEL) {
		StringRef Features =
		Kernel->getFnAttribute("target-features").getValueAsString();
		if (Features.count("+wavefrontsize64"))
		return omp::getAMDGPUGridValues<64>();
		return omp::getAMDGPUGridValues<32>();
		}
		// Assume NVPTX for now.
		jhuber6Unsubmitted Not Done Reply Inline Actions Can't we just get the module from the Function and check the triple? jhuber6: Can't we just get the module from the Function and check the triple?
		return omp::NVPTXGridValues;
		}

void OpenMPIRBuilder::setOutlinedTargetRegionFunctionAttributes(		void OpenMPIRBuilder::setOutlinedTargetRegionFunctionAttributes(
Function *OutlinedFn, int32_t NumTeams, int32_t NumThreads) {		Function *OutlinedFn, int32_t NumTeams, int32_t NumThreads) {
if (Config.isTargetDevice()) {		if (Config.isTargetDevice()) {
OutlinedFn->setLinkage(GlobalValue::WeakODRLinkage);		OutlinedFn->setLinkage(GlobalValue::WeakODRLinkage);
// TODO: Determine if DSO local can be set to true.		// TODO: Determine if DSO local can be set to true.
OutlinedFn->setDSOLocal(false);		OutlinedFn->setDSOLocal(false);
OutlinedFn->setVisibility(GlobalValue::ProtectedVisibility);		OutlinedFn->setVisibility(GlobalValue::ProtectedVisibility);
if (Triple(M.getTargetTriple()).isAMDGCN())		if (Triple(M.getTargetTriple()).isAMDGCN())
OutlinedFn->setCallingConv(CallingConv::AMDGPU_KERNEL);		OutlinedFn->setCallingConv(CallingConv::AMDGPU_KERNEL);
}		}

if (NumTeams > 0)		if (NumTeams > 0)
OutlinedFn->addFnAttr("omp_target_num_teams", std::to_string(NumTeams));		OutlinedFn->addFnAttr("omp_target_num_teams", std::to_string(NumTeams));

		if (NumThreads == -1)
		NumThreads = getGridValue(OutlinedFn).GV_Default_WG_Size;

if (NumThreads > 0) {		if (NumThreads > 0) {
if (OutlinedFn->getCallingConv() == CallingConv::AMDGPU_KERNEL) {		if (OutlinedFn->getCallingConv() == CallingConv::AMDGPU_KERNEL) {
OutlinedFn->addFnAttr("amdgpu-flat-work-group-size",		OutlinedFn->addFnAttr("amdgpu-flat-work-group-size",
llvm::utostr(NumThreads) + "," +		llvm::utostr(1) + "," + llvm::utostr(NumThreads));
llvm::utostr(NumThreads));
} else {		} else {
// TODO: Modify or create "maxntidx" module metadata.		// TODO: Modify or create "maxntidx" module metadata.
}		}
OutlinedFn->addFnAttr("omp_target_thread_limit",		OutlinedFn->addFnAttr("omp_target_thread_limit",
std::to_string(NumThreads));		std::to_string(NumThreads));
}		}
}		}

▲ Show 20 Lines • Show All 2,196 Lines • Show Last 20 Lines

openmp/libomptarget/test/offloading/default_thread_limit.c

This file was added.

				// clang-format off
				// RUN: %libomptarget-compile-generic
				// RUN: env LIBOMPTARGET_INFO=16 \
				// RUN: %libomptarget-run-generic 2>&1 \| %fcheck-generic --check-prefix=DEFAULT

				// UNSUPPORTED: nvptx64-nvidia-cuda
				// UNSUPPORTED: nvptx64-nvidia-cuda-LTO
				// UNSUPPORTED: aarch64-unknown-linux-gnu
				// UNSUPPORTED: aarch64-unknown-linux-gnu-LTO
				// UNSUPPORTED: x86_64-pc-linux-gnu
				// UNSUPPORTED: x86_64-pc-linux-gnu-LTO

				__attribute__((optnone)) int optnone() { return 1; }

				int main() {
				int N = optnone() * 4098 * 32;

				// DEFAULT: [[NT:(128\|256)]] (MaxFlatWorkGroupSize: [[NT]]
				#pragma omp target teams distribute parallel for simd
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: [[NT:(128\|256)]] (MaxFlatWorkGroupSize: [[NT]]
				#pragma omp target teams distribute parallel for simd
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: [[NT:(128\|256)]] (MaxFlatWorkGroupSize: [[NT]]
				#pragma omp target teams distribute parallel for simd
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: [[NT:(128\|256)]] (MaxFlatWorkGroupSize: [[NT]]
				#pragma omp target
				#pragma omp teams distribute parallel for
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: 42 (MaxFlatWorkGroupSize: 1024
				#pragma omp target thread_limit(optnone() * 42)
				#pragma omp teams distribute parallel for
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: 42 (MaxFlatWorkGroupSize: 42
				#pragma omp target thread_limit(optnone() * 42) ompx_attribute(__attribute__((amdgpu_flat_work_group_size(42, 42))))
				#pragma omp teams distribute parallel for
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// FIXME: Use the attribute value to imply a thread_limit
				// DEFAULT: {{(128\|256)}} (MaxFlatWorkGroupSize: 42
				#pragma omp target ompx_attribute(__attribute__((amdgpu_flat_work_group_size(42, 42))))
				#pragma omp teams distribute parallel for
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: MaxFlatWorkGroupSize: 1024
				#pragma omp target
				#pragma omp teams distribute parallel for num_threads(optnone() * 42)
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: MaxFlatWorkGroupSize: 1024
				#pragma omp target teams distribute parallel for thread_limit(optnone() * 42)
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: MaxFlatWorkGroupSize: 1024
				#pragma omp target teams distribute parallel for num_threads(optnone() * 42)
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: 9 (MaxFlatWorkGroupSize: 9
				#pragma omp target
				#pragma omp teams distribute parallel for num_threads(9)
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: 4 (MaxFlatWorkGroupSize: 4
				#pragma omp target thread_limit(4)
				#pragma omp teams distribute parallel for
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: 4 (MaxFlatWorkGroupSize: 4
				#pragma omp target
				#pragma omp teams distribute parallel for thread_limit(4)
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: 9 (MaxFlatWorkGroupSize: 9
				#pragma omp target teams distribute parallel for num_threads(9)
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: 4 (MaxFlatWorkGroupSize: 4
				#pragma omp target teams distribute parallel for simd thread_limit(4)
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				}