This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/
-
docs/design/
-
design/
-
Runtimes.rst
-
libomptarget/
-
plugins-nextgen/common/PluginInterface/
-
common/
-
PluginInterface/
2/3
PluginInterface.h
1/1
PluginInterface.cpp
-
test/offloading/
-
offloading/
-
small_trip_count.c

Differential D152014

[OpenMP] Improve default block count selection fow low block counts
ClosedPublic

Authored by jdoerfert on Jun 2 2023, 10:48 AM.

Download Raw Diff

Details

Reviewers

jhuber6
jplehr
tianshilei1992
fel-cab

Commits

rG6629a96a8ce5: [OpenMP] Improve default block count selection fow low block counts

Summary

If a combined loop has insufficient parallelism (= low trip count), we
might end up with too few teams/blocks. To counter that we can reduce
the number of threads per team we use. This patch implements a heuristic
and exposes a new environment variable to control the minimum of threads
to be employed in this case.

Issue reported by:
Felipe Cabarcas Jaramillo <cabarcas@udel.edu> (@fel-cab).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.Jun 2 2023, 10:48 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2023, 10:48 AM

Herald added subscribers: sunshaoce, guansong, bollu, yaxunl. · View Herald Transcript

jdoerfert requested review of this revision.Jun 2 2023, 10:48 AM

Herald added a subscriber: sstefan1. · View Herald TranscriptJun 2 2023, 10:48 AM

Harbormaster completed remote builds in B236219: Diff 527909.Jun 2 2023, 10:51 AM

jhuber6 added inline comments.Jun 2 2023, 10:56 AM

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
800	Shouldn't this correspond to the warp / wavefront size? On NVPTX it's 32 but on AMDGPU it could be 32 or 64. You can check using HSA.

jdoerfert added inline comments.Jun 2 2023, 10:58 AM

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
800	Not necessarily. AMD doesn't even have one 64 wide wave anyway, IIRC. We are running some tests on AMD hardware right now, will adjust if 64 comes back better.

This also breaks thread_limit right?

omp target teams thread_limit(16)
omp parallel

jdoerfert added inline comments.Jun 2 2023, 2:46 PM

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp
339	@tianshilei1992 Yes, I missed a std::min here, will fix that in the final version.
openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h
800	Results are in for Frontier. 8,16,32 are all "the same" for the code, 64 is worse. 32 is the winner (so far).

Ensure thread_limit is honored.

Harbormaster completed remote builds in B236290: Diff 528003.Jun 2 2023, 3:01 PM

I have tested it on frontier with SPECacc 552.pep with different values of

LIBOMPTARGET_MIN_THREADS_FOR_LOW_TRIP_COUNT

Env Execution_Time(secs)
4    7
8    5
16   5
32   5
64   10
128  17
256  30
Without the patch 30

Force a power of two for the "middle" case, ensure thread_limit is honored.

Harbormaster completed remote builds in B236301: Diff 528019.Jun 2 2023, 3:28 PM

This revision is now accepted and ready to land.Jun 2 2023, 4:07 PM

Closed by commit rG6629a96a8ce5: [OpenMP] Improve default block count selection fow low block counts (authored by jdoerfert). · Explain WhyJun 5 2023, 4:36 PM

This revision was automatically updated to reflect the committed changes.

jdoerfert added a commit: rG6629a96a8ce5: [OpenMP] Improve default block count selection fow low block counts.

Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 4:36 PM

Herald added a subscriber: openmp-commits. · View Herald Transcript

tianshilei1992 mentioned this in D158802: [OpenMP] Honor `thread_limit` value when choosing grid size.Aug 24 2023, 6:14 PM

tianshilei1992 mentioned this in rGfbcce3370644: [OpenMP] Honor `thread_limit` value when choosing grid size.Aug 26 2023, 7:18 PM

dhruvachak mentioned this in D158382: [OpenMP] Use default grid value for static grid size.Aug 28 2023, 10:17 AM

GitHub <noreply@github.com> mentioned this in rG0d5b7dd25cc4: [OpenMP] Add a test for D158802 (#70678).Oct 30 2023, 12:59 PM

Revision Contents

Path

Size

openmp/

docs/

design/

Runtimes.rst

15 lines

libomptarget/

plugins-nextgen/

common/

PluginInterface/

PluginInterface.h

18 lines

PluginInterface.cpp

48 lines

test/

offloading/

small_trip_count.c

41 lines

Diff 528628

openmp/docs/design/Runtimes.rst

Show First 20 Lines • Show All 714 Lines • ▼ Show 20 Lines	variables is defined below.
* ``LIBOMPTARGET_SHARED_MEMORY_SIZE=<Num>``		* ``LIBOMPTARGET_SHARED_MEMORY_SIZE=<Num>``
* ``LIBOMPTARGET_MAP_FORCE_ATOMIC=[TRUE/FALSE] (default TRUE)``		* ``LIBOMPTARGET_MAP_FORCE_ATOMIC=[TRUE/FALSE] (default TRUE)``
* ``LIBOMPTARGET_JIT_OPT_LEVEL={0,1,2,3} (default 3)``		* ``LIBOMPTARGET_JIT_OPT_LEVEL={0,1,2,3} (default 3)``
* ``LIBOMPTARGET_JIT_SKIP_OPT=[TRUE/FALSE] (default FALSE)``		* ``LIBOMPTARGET_JIT_SKIP_OPT=[TRUE/FALSE] (default FALSE)``
* ``LIBOMPTARGET_JIT_REPLACEMENT_OBJECT=<in:Filename> (object file)``		* ``LIBOMPTARGET_JIT_REPLACEMENT_OBJECT=<in:Filename> (object file)``
* ``LIBOMPTARGET_JIT_REPLACEMENT_MODULE=<in:Filename> (LLVM-IR file)``		* ``LIBOMPTARGET_JIT_REPLACEMENT_MODULE=<in:Filename> (LLVM-IR file)``
* ``LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE=<out:Filename> (LLVM-IR file)``		* ``LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE=<out:Filename> (LLVM-IR file)``
* ``LIBOMPTARGET_JIT_POST_OPT_IR_MODULE=<out:Filename> (LLVM-IR file)``		* ``LIBOMPTARGET_JIT_POST_OPT_IR_MODULE=<out:Filename> (LLVM-IR file)``
		* ``LIBOMPTARGET_MIN_THREADS_FOR_LOW_TRIP_COUNT=<Num> (default: 32)``

LIBOMPTARGET_DEBUG		LIBOMPTARGET_DEBUG
""""""""""""""""""		""""""""""""""""""

``LIBOMPTARGET_DEBUG`` controls whether or not debugging information will be		``LIBOMPTARGET_DEBUG`` controls whether or not debugging information will be
displayed. This feature is only available if ``libomptarget`` was built with		displayed. This feature is only available if ``libomptarget`` was built with
``-DOMPTARGET_DEBUG``. The debugging output provided is intended for use by		``-DOMPTARGET_DEBUG``. The debugging output provided is intended for use by
``libomptarget`` developers. More user-friendly output is presented when using		``libomptarget`` developers. More user-friendly output is presented when using
▲ Show 20 Lines • Show All 372 Lines • ▼ Show 20 Lines
before the device JIT runs additional IR optimizations on it (see		before the device JIT runs additional IR optimizations on it (see
:ref:`LIBOMPTARGET_JIT_OPT_LEVEL`). The value is expected to be a filename into		:ref:`LIBOMPTARGET_JIT_OPT_LEVEL`). The value is expected to be a filename into
which the LLVM-IR module is written. The module can be the analyzed, and		which the LLVM-IR module is written. The module can be the analyzed, and
transformed and loaded back into the JIT pipeline via		transformed and loaded back into the JIT pipeline via
:ref:`LIBOMPTARGET_JIT_REPLACEMENT_MODULE`.		:ref:`LIBOMPTARGET_JIT_REPLACEMENT_MODULE`.


LIBOMPTARGET_JIT_POST_OPT_IR_MODULE		LIBOMPTARGET_JIT_POST_OPT_IR_MODULE
""""""""""""""""""""""""""""""""""		"""""""""""""""""""""""""""""""""""

This environment variable can be used to extract the embedded device code after		This environment variable can be used to extract the embedded device code after
the device JIT runs additional IR optimizations on it (see		the device JIT runs additional IR optimizations on it (see
:ref:`LIBOMPTARGET_JIT_OPT_LEVEL`). The value is expected to be a filename into		:ref:`LIBOMPTARGET_JIT_OPT_LEVEL`). The value is expected to be a filename into
which the LLVM-IR module is written. The module can be the analyzed, and		which the LLVM-IR module is written. The module can be the analyzed, and
transformed and loaded back into the JIT pipeline via		transformed and loaded back into the JIT pipeline via
:ref:`LIBOMPTARGET_JIT_REPLACEMENT_MODULE`.		:ref:`LIBOMPTARGET_JIT_REPLACEMENT_MODULE`.


		LIBOMPTARGET_MIN_THREADS_FOR_LOW_TRIP_COUNT
		"""""""""""""""""""""""""""""""""""""""""""

		This environment variable defines a lower bound for the number of threads if a
		combined kernel, e.g., `target teams distribute parallel for`, has insufficient
		parallelism. Especially if the trip count of the loops is lower than the number
		of threads possible times the number of teams (aka. blocks) the device preferes
		(see also :ref:`LIBOMPTARGET_AMDGPU_TEAMS_PER_CU), we will reduce the thread
		count to increase outer (team/block) parallelism. The thread count will never
		be reduced below the value passed for this environment variable though.



.. _libomptarget_plugin:		.. _libomptarget_plugin:

LLVM/OpenMP Target Host Runtime Plugins (``libomptarget.rtl.XXXX``)		LLVM/OpenMP Target Host Runtime Plugins (``libomptarget.rtl.XXXX``)
-------------------------------------------------------------------		-------------------------------------------------------------------

The LLVM/OpenMP target host runtime plugins were recently re-implemented,		The LLVM/OpenMP target host runtime plugins were recently re-implemented,
temporarily renamed as the NextGen plugins, and set as the default and only		temporarily renamed as the NextGen plugins, and set as the default and only
▲ Show 20 Lines • Show All 288 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

Show First 20 Lines • Show All 307 Lines • ▼ Show 20 Lines	private:
/// Get the default number of threads and blocks for the kernel.		/// Get the default number of threads and blocks for the kernel.
virtual uint32_t getDefaultNumThreads(GenericDeviceTy &Device) const = 0;		virtual uint32_t getDefaultNumThreads(GenericDeviceTy &Device) const = 0;
virtual uint32_t getDefaultNumBlocks(GenericDeviceTy &Device) const = 0;		virtual uint32_t getDefaultNumBlocks(GenericDeviceTy &Device) const = 0;

/// Get the number of threads and blocks for the kernel based on the		/// Get the number of threads and blocks for the kernel based on the
/// user-defined threads and block clauses.		/// user-defined threads and block clauses.
uint32_t getNumThreads(GenericDeviceTy &GenericDevice,		uint32_t getNumThreads(GenericDeviceTy &GenericDevice,
uint32_t ThreadLimitClause[3]) const;		uint32_t ThreadLimitClause[3]) const;

		/// The number of threads \p NumThreads can be adjusted by this method.
uint64_t getNumBlocks(GenericDeviceTy &GenericDevice,		uint64_t getNumBlocks(GenericDeviceTy &GenericDevice,
uint32_t BlockLimitClause[3], uint64_t LoopTripCount,		uint32_t BlockLimitClause[3], uint64_t LoopTripCount,
uint32_t NumThreads) const;		uint32_t &NumThreads) const;

/// Indicate if the kernel works in Generic SPMD, Generic or SPMD mode.		/// Indicate if the kernel works in Generic SPMD, Generic or SPMD mode.
bool isGenericSPMDMode() const {		bool isGenericSPMDMode() const {
return ExecutionMode == OMP_TGT_EXEC_MODE_GENERIC_SPMD;		return ExecutionMode == OMP_TGT_EXEC_MODE_GENERIC_SPMD;
}		}
bool isGenericMode() const {		bool isGenericMode() const {
return ExecutionMode == OMP_TGT_EXEC_MODE_GENERIC;		return ExecutionMode == OMP_TGT_EXEC_MODE_GENERIC;
}		}
▲ Show 20 Lines • Show All 408 Lines • ▼ Show 20 Lines	struct GenericDeviceTy : public DeviceAllocatorTy {
virtual std::string getComputeUnitKind() const { return "unknown"; }		virtual std::string getComputeUnitKind() const { return "unknown"; }

/// Post processing after jit backend. The ownership of \p MB will be taken.		/// Post processing after jit backend. The ownership of \p MB will be taken.
virtual Expected<std::unique_ptr<MemoryBuffer>>		virtual Expected<std::unique_ptr<MemoryBuffer>>
doJITPostProcessing(std::unique_ptr<MemoryBuffer> MB) const {		doJITPostProcessing(std::unique_ptr<MemoryBuffer> MB) const {
return std::move(MB);		return std::move(MB);
}		}

		/// The minimum number of threads we use for a low-trip count combined loop.
		/// Instead of using more threads we increase the outer (block/team)
		/// parallelism.
		/// @see OMPX_MinThreadsForLowTripCount
		virtual uint32_t getMinThreadsForLowTripCountLoop() {
		return OMPX_MinThreadsForLowTripCount;
		}

private:		private:
/// Register offload entry for global variable.		/// Register offload entry for global variable.
Error registerGlobalOffloadEntry(DeviceImageTy &DeviceImage,		Error registerGlobalOffloadEntry(DeviceImageTy &DeviceImage,
const __tgt_offload_entry &GlobalEntry,		const __tgt_offload_entry &GlobalEntry,
__tgt_offload_entry &DeviceEntry);		__tgt_offload_entry &DeviceEntry);

/// Register offload entry for kernel function.		/// Register offload entry for kernel function.
Error registerKernelOffloadEntry(DeviceImageTy &DeviceImage,		Error registerKernelOffloadEntry(DeviceImageTy &DeviceImage,
Show All 27 Lines	private:
Int32Envar OMP_TeamsThreadLimit;		Int32Envar OMP_TeamsThreadLimit;

/// Environment variables defined by the LLVM OpenMP implementation.		/// Environment variables defined by the LLVM OpenMP implementation.
Int32Envar OMPX_DebugKind;		Int32Envar OMPX_DebugKind;
UInt32Envar OMPX_SharedMemorySize;		UInt32Envar OMPX_SharedMemorySize;
UInt64Envar OMPX_TargetStackSize;		UInt64Envar OMPX_TargetStackSize;
UInt64Envar OMPX_TargetHeapSize;		UInt64Envar OMPX_TargetHeapSize;

		/// Environment flag to set the minimum number of threads we use for a
		/// low-trip count combined loop. Instead of using more threads we increase
		/// the outer (block/team) parallelism.
		UInt32Envar OMPX_MinThreadsForLowTripCount =
		UInt32Envar("LIBOMPTARGET_MIN_THREADS_FOR_LOW_TRIP_COUNT", 32);
		jhuber6Unsubmitted Not Done Reply Inline Actions Shouldn't this correspond to the warp / wavefront size? On NVPTX it's 32 but on AMDGPU it could be 32 or 64. You can check using HSA. jhuber6: Shouldn't this correspond to the warp / wavefront size? On NVPTX it's 32 but on AMDGPU it could…
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Not necessarily. AMD doesn't even have one 64 wide wave anyway, IIRC. We are running some tests on AMD hardware right now, will adjust if 64 comes back better. jdoerfert: Not necessarily. AMD doesn't even have one 64 wide wave anyway, IIRC. We are running some tests…
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions Results are in for Frontier. 8,16,32 are all "the same" for the code, 64 is worse. 32 is the winner (so far). jdoerfert: Results are in for Frontier. 8,16,32 are all "the same" for the code, 64 is worse. 32 is the…

protected:		protected:
/// Return the execution mode used for kernel \p Name.		/// Return the execution mode used for kernel \p Name.
Expected<OMPTgtExecModeFlags> getExecutionModeForKernel(StringRef Name,		Expected<OMPTgtExecModeFlags> getExecutionModeForKernel(StringRef Name,
DeviceImageTy &Image);		DeviceImageTy &Image);

/// Environment variables defined by the LLVM OpenMP implementation		/// Environment variables defined by the LLVM OpenMP implementation
/// regarding the initial number of streams and events.		/// regarding the initial number of streams and events.
UInt32Envar OMPX_InitialNumStreams;		UInt32Envar OMPX_InitialNumStreams;
▲ Show 20 Lines • Show All 407 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

Show All 13 Lines
#include "JIT.h"		#include "JIT.h"
#include "elf_common.h"		#include "elf_common.h"
#include "omptarget.h"		#include "omptarget.h"
#include "omptargetplugin.h"		#include "omptargetplugin.h"

#include "llvm/Frontend/OpenMP/OMPConstants.h"		#include "llvm/Frontend/OpenMP/OMPConstants.h"
#include "llvm/Support/Error.h"		#include "llvm/Support/Error.h"
#include "llvm/Support/JSON.h"		#include "llvm/Support/JSON.h"
		#include "llvm/Support/MathExtras.h"
#include "llvm/Support/MemoryBuffer.h"		#include "llvm/Support/MemoryBuffer.h"

#include <cstdint>		#include <cstdint>
#include <limits>		#include <limits>

using namespace llvm;		using namespace llvm;
using namespace omp;		using namespace omp;
using namespace target;		using namespace target;
▲ Show 20 Lines • Show All 266 Lines • ▼ Show 20 Lines	uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
return std::min(MaxNumThreads, (ThreadLimitClause[0] > 0)		return std::min(MaxNumThreads, (ThreadLimitClause[0] > 0)
? ThreadLimitClause[0]		? ThreadLimitClause[0]
: PreferredNumThreads);		: PreferredNumThreads);
}		}

uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,		uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
uint32_t NumTeamsClause[3],		uint32_t NumTeamsClause[3],
uint64_t LoopTripCount,		uint64_t LoopTripCount,
uint32_t NumThreads) const {		uint32_t &NumThreads) const {
assert(NumTeamsClause[1] == 0 && NumTeamsClause[2] == 0 &&		assert(NumTeamsClause[1] == 0 && NumTeamsClause[2] == 0 &&
"Multi dimensional launch not supported yet.");		"Multi dimensional launch not supported yet.");

if (NumTeamsClause[0] > 0) {		if (NumTeamsClause[0] > 0) {
// TODO: We need to honor any value and consequently allow more than the		// TODO: We need to honor any value and consequently allow more than the
// block limit. For this we might need to start multiple kernels or let the		// block limit. For this we might need to start multiple kernels or let the
// blocks start again until the requested number has been started.		// blocks start again until the requested number has been started.
return std::min(NumTeamsClause[0], GenericDevice.getBlockLimit());		return std::min(NumTeamsClause[0], GenericDevice.getBlockLimit());
}		}

		uint64_t DefaultNumBlocks = getDefaultNumBlocks(GenericDevice);
uint64_t TripCountNumBlocks = std::numeric_limits<uint64_t>::max();		uint64_t TripCountNumBlocks = std::numeric_limits<uint64_t>::max();
if (LoopTripCount > 0) {		if (LoopTripCount > 0) {
if (isSPMDMode()) {		if (isSPMDMode()) {
// We have a combined construct, i.e. `target teams distribute		// We have a combined construct, i.e. `target teams distribute
// parallel for [simd]`. We launch so many teams so that each thread		// parallel for [simd]`. We launch so many teams so that each thread
// will execute one iteration of the loop. round up to the nearest		// will execute one iteration of the loop; rounded up to the nearest
// integer		// integer. However, if that results in too few teams, we artificially
		// reduce the thread count per team to increase the outer parallelism.
		auto MinThreads = GenericDevice.getMinThreadsForLowTripCountLoop();
		MinThreads = std::min(MinThreads, NumThreads);

		// Honor the thread_limit clause; only lower the number of threads.
		auto OldNumThreads = NumThreads;
		if (LoopTripCount >= DefaultNumBlocks * NumThreads) {
		// Enough parallelism for teams and threads.
TripCountNumBlocks = ((LoopTripCount - 1) / NumThreads) + 1;		TripCountNumBlocks = ((LoopTripCount - 1) / NumThreads) + 1;
		assert(TripCountNumBlocks >= DefaultNumBlocks &&
		"Expected sufficient outer parallelism.");
		} else if (LoopTripCount >= DefaultNumBlocks * MinThreads) {
		// Enough parallelism for teams, limit threads.

		// This case is hard; for now, we force "full warps":
		// First, compute a thread count assuming DefaultNumBlocks.
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions @tianshilei1992 Yes, I missed a std::min here, will fix that in the final version. jdoerfert: @tianshilei1992 Yes, I missed a std::min here, will fix that in the final version.
		auto NumThreadsDefaultBlocks =
		(LoopTripCount + DefaultNumBlocks - 1) / DefaultNumBlocks;
		// Now get a power of two that is larger or equal.
		auto NumThreadsDefaultBlocksP2 =
		llvm::PowerOf2Ceil(NumThreadsDefaultBlocks);
		// Do not increase a thread limit given be the user.
		NumThreads = std::min(NumThreads, uint32_t(NumThreadsDefaultBlocksP2));
		assert(NumThreads >= MinThreads &&
		"Expected sufficient inner parallelism.");
		TripCountNumBlocks = ((LoopTripCount - 1) / NumThreads) + 1;
		} else {
		// Not enough parallelism for teams and threads, limit both.
		NumThreads = std::min(NumThreads, MinThreads);
		TripCountNumBlocks = ((LoopTripCount - 1) / NumThreads) + 1;
		}

		assert(NumThreads * TripCountNumBlocks >= LoopTripCount &&
		"Expected sufficient parallelism");
		assert(OldNumThreads >= NumThreads &&
		"Number of threads cannot be increased!");
} else {		} else {
assert((isGenericMode() \|\| isGenericSPMDMode()) &&		assert((isGenericMode() \|\| isGenericSPMDMode()) &&
"Unexpected execution mode!");		"Unexpected execution mode!");
// If we reach this point, then we have a non-combined construct, i.e.		// If we reach this point, then we have a non-combined construct, i.e.
// `teams distribute` with a nested `parallel for` and each team is		// `teams distribute` with a nested `parallel for` and each team is
// assigned one iteration of the `distribute` loop. E.g.:		// assigned one iteration of the `distribute` loop. E.g.:
//		//
// #pragma omp target teams distribute		// #pragma omp target teams distribute
// for(...loop_tripcount...) {		// for(...loop_tripcount...) {
// #pragma omp parallel for		// #pragma omp parallel for
// for(...) {}		// for(...) {}
// }		// }
//		//
// Threads within a team will execute the iterations of the `parallel`		// Threads within a team will execute the iterations of the `parallel`
// loop.		// loop.
TripCountNumBlocks = LoopTripCount;		TripCountNumBlocks = LoopTripCount;
}		}
}		}
// If the loops are long running we rather reuse blocks than spawn too many.		// If the loops are long running we rather reuse blocks than spawn too many.
uint32_t PreferredNumBlocks = std::min(uint32_t(TripCountNumBlocks),		uint32_t PreferredNumBlocks = std::min(TripCountNumBlocks, DefaultNumBlocks);
getDefaultNumBlocks(GenericDevice));
return std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());		return std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());
}		}

GenericDeviceTy::GenericDeviceTy(int32_t DeviceId, int32_t NumDevices,		GenericDeviceTy::GenericDeviceTy(int32_t DeviceId, int32_t NumDevices,
const llvm::omp::GV &OMPGridValues)		const llvm::omp::GV &OMPGridValues)
: MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),		: MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),
OMP_NumTeams("OMP_NUM_TEAMS"),		OMP_NumTeams("OMP_NUM_TEAMS"),
OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),		OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),
▲ Show 20 Lines • Show All 1,169 Lines • Show Last 20 Lines

openmp/libomptarget/test/offloading/small_trip_count.c

This file was added.

				// clang-format off
				// RUN: %libomptarget-compile-generic
				// RUN: env LIBOMPTARGET_INFO=16 \
				// RUN: %libomptarget-run-generic 2>&1 \| %fcheck-generic --check-prefix=DEFAULT
				// RUN: env LIBOMPTARGET_INFO=16 LIBOMPTARGET_MIN_THREADS_FOR_LOW_TRIP_COUNT=8 \
				// RUN: %libomptarget-run-generic 2>&1 \| %fcheck-generic --check-prefix=EIGHT

				// UNSUPPORTED: x86_64-pc-linux-gnu
				// UNSUPPORTED: x86_64-pc-linux-gnu-LTO

				#define N 128

				__attribute__((optnone)) void optnone() {}

				int main() {
				// DEFAULT: Launching kernel {{.+_main_.+}} with 4 blocks and 32 threads in SPMD mode
				// EIGHT: Launching kernel {{.+_main_.+}} with 16 blocks and 8 threads in SPMD mode
				#pragma omp target teams distribute parallel for simd
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				// DEFAULT: Launching kernel {{.+_main_.+}} with 4 blocks and 32 threads in SPMD mode
				// EIGHT: Launching kernel {{.+_main_.+}} with 16 blocks and 8 threads in SPMD mode
				#pragma omp target teams distribute parallel for simd
				for (int i = 0; i < N - 1; ++i) {
				optnone();
				}
				// DEFAULT: Launching kernel {{.+_main_.+}} with 5 blocks and 32 threads in SPMD mode
				// EIGHT: Launching kernel {{.+_main_.+}} with 17 blocks and 8 threads in SPMD mode
				#pragma omp target teams distribute parallel for simd
				for (int i = 0; i < N + 1; ++i) {
				optnone();
				}
				// DEFAULT: Launching kernel {{.+_main_.+}} with 32 blocks and 4 threads in SPMD mode
				// EIGHT: Launching kernel {{.+_main_.+}} with 32 blocks and 4 threads in SPMD mode
				#pragma omp target teams distribute parallel for simd thread_limit(4)
				for (int i = 0; i < N; ++i) {
				optnone();
				}
				}