This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/common/
-
libomptarget/
-
deviceRTLs/
-
common/
-
allocator.h
-
omptarget.h
-
omptargeti.h
-
src/
-
loop.cu
-
parallel.cu

Differential D98678

[OpenMP][DeviceRT] Remove eager allocation for dynamic schedule handling
AcceptedPublic

Authored by jdoerfert on Mar 15 2021, 8:32 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
tianshilei1992
grokos
ye-luo
bollu
ronlieb

Summary

This removes roughly 20% of the memory allocated statically by the
runtime (1010592744 vs 801186888 bytes). It will make loops without
statically known static schedule more expensive (for the one time
allocation and free). It will make new data environments, e.g., nested
parallels, cheaper, as we don't state/reload the "loop data" anymore.
Overall, it's a trade-off which makes sure you only pay for what you
use.

A better solution would be to extend the _dispatch_ API such that we
can pass in "stack-allocations" to hold the data. Since that requires
more work this seemed like a good first step.

Probably needs more testing.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jdoerfert created this revision.Mar 15 2021, 8:32 PM

Herald added a reviewer: bollu. · View Herald TranscriptMar 15 2021, 8:32 PM

Herald added subscribers: guansong, yaxunl. · View Herald Transcript

jdoerfert requested review of this revision.Mar 15 2021, 8:32 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2021, 8:32 PM

Herald added a subscriber: sstefan1. · View Herald Transcript

Move type *before* the template class

Harbormaster completed remote builds in B93967: Diff 330869.Mar 15 2021, 9:06 PM

Harbormaster completed remote builds in B93969: Diff 330871.Mar 15 2021, 9:24 PM

jdoerfert mentioned this in D98713: [OpenMP][WIP] Move run-sched-var ICV to ICVStateTy to make TaskDescr obsolete.Mar 16 2021, 8:00 AM

ronlieb added a subscriber: ronlieb.Mar 16 2021, 10:00 AM

Memory saving is real on my side as well 3074MB -> 2534MB. I have been using this for a few days no problems found so far.

This revision is now accepted and ready to land.Mar 18 2021, 4:40 PM

I failed to apply this to amd-stg-open to test, but may be able to run qmcpack against llvm main now. Will try that.

In D98678#2636197, @JonChesterfield wrote:

I failed to apply this to amd-stg-open to test, but may be able to run qmcpack against llvm main now. Will try that.

What turns out in your try?

In D98678#2641938, @ye-luo wrote:

In D98678#2636197, @JonChesterfield wrote:

I failed to apply this to amd-stg-open to test, but may be able to run qmcpack against llvm main now. Will try that.

What turns out in your try?

qmcpack fails to build with trunk llvm on amdgcn, before applying this patch. Work in progress...

In D98678#2641994, @JonChesterfield wrote:

In D98678#2641938, @ye-luo wrote:

In D98678#2636197, @JonChesterfield wrote:

I failed to apply this to amd-stg-open to test, but may be able to run qmcpack against llvm main now. Will try that.

What turns out in your try?

qmcpack fails to build with trunk llvm on amdgcn, before applying this patch. Work in progress...

It is OK to commit this patch to trunk now? It sounds that your testing is stuck with another set of issues.

I'm not blocking. It doesn't merge cleanly into rocm so it's hard to run against aomp's testing.

The patch was written for qmcpack so it's not surprising it works there. @jdoerfert have you tried other applications?

Adding Ron in case he can run this against the amd-stg-open branch, and as fore warning of merge conflicts if not.

In D98678#2642158, @JonChesterfield wrote:

The patch was written for qmcpack so it's not surprising it works there. @jdoerfert have you tried other applications?

I did run it with miniqmc only. I don't assume any user of a statically known static schedule to be affected by this. And I do not know of applications that don't match that.

@ronlieb @JonChesterfield Can you run it (=trunk + this) against your tests?

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

common/

allocator.h

6 lines

omptarget.h

24 lines

omptargeti.h

24 lines

src/

loop.cu

151 lines

parallel.cu

2 lines

Diff 330871

openmp/libomptarget/deviceRTLs/common/allocator.h

	Show All 33 Lines

	#define SHARED(NAME) \			#define SHARED(NAME) \
	NAME [[clang::loader_uninitialized]]; \			NAME [[clang::loader_uninitialized]]; \
	OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))			OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))

	#define EXTERN_SHARED(NAME) \			#define EXTERN_SHARED(NAME) \
	NAME; \			NAME; \
	OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))			OMP_PRAGMA(allocate(NAME) allocator(omp_pteam_mem_alloc))

				// TODO: clang should use address space 5 for omp_thread_mem_alloc, but right
				// now that's not the case.
				#define THREAD_LOCAL(NAME) \
				NAME [[clang::loader_uninitialized, clang::address_space(5)]]

	#endif			#endif

	#endif // OMPTARGET_ALLOCATOR_H			#endif // OMPTARGET_ALLOCATOR_H

openmp/libomptarget/deviceRTLs/common/omptarget.h

//===---- omptarget.h - OpenMP GPU initialization ---------------- CUDA -*-===//		//===---- omptarget.h - OpenMP GPU initialization ---------------- CUDA -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file contains the declarations of all library macros, types,		// This file contains the declarations of all library macros, types,
// and functions.		// and functions.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef OMPTARGET_H		#ifndef OMPTARGET_H
#define OMPTARGET_H		#define OMPTARGET_H

#include "common/allocator.h"		#include "common/allocator.h"
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'common/allocator.h' file not found [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: 'common/allocator.h' file not found [clang-diagnostic-error] [[https…
#include "common/debug.h" // debug		#include "common/debug.h" // debug
#include "common/state-queue.h"		#include "common/state-queue.h"
#include "common/support.h"		#include "common/support.h"
#include "interface.h" // interfaces with omp, compiler, and user		#include "interface.h" // interfaces with omp, compiler, and user
#include "target_impl.h"		#include "target_impl.h"

#define OMPTARGET_NVPTX_VERSION 1.1		#define OMPTARGET_NVPTX_VERSION 1.1

▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	public:
INLINE void Copy(omptarget_nvptx_TaskDescr *sourceTaskDescr);		INLINE void Copy(omptarget_nvptx_TaskDescr *sourceTaskDescr);
INLINE void CopyData(omptarget_nvptx_TaskDescr *sourceTaskDescr);		INLINE void CopyData(omptarget_nvptx_TaskDescr *sourceTaskDescr);
INLINE void CopyParent(omptarget_nvptx_TaskDescr *parentTaskDescr);		INLINE void CopyParent(omptarget_nvptx_TaskDescr *parentTaskDescr);
INLINE void CopyForExplicitTask(omptarget_nvptx_TaskDescr *parentTaskDescr);		INLINE void CopyForExplicitTask(omptarget_nvptx_TaskDescr *parentTaskDescr);
INLINE void CopyToWorkDescr(omptarget_nvptx_TaskDescr *masterTaskDescr);		INLINE void CopyToWorkDescr(omptarget_nvptx_TaskDescr *masterTaskDescr);
INLINE void CopyFromWorkDescr(omptarget_nvptx_TaskDescr *workTaskDescr);		INLINE void CopyFromWorkDescr(omptarget_nvptx_TaskDescr *workTaskDescr);
INLINE void CopyConvergentParent(omptarget_nvptx_TaskDescr *parentTaskDescr,		INLINE void CopyConvergentParent(omptarget_nvptx_TaskDescr *parentTaskDescr,
uint16_t tid, uint16_t tnum);		uint16_t tid, uint16_t tnum);
INLINE void SaveLoopData();
INLINE void RestoreLoopData() const;

private:		private:
// bits for flags: (6 used, 2 free)		// bits for flags: (6 used, 2 free)
// 3 bits (SchedMask) for runtime schedule		// 3 bits (SchedMask) for runtime schedule
// 1 bit (InPar) if this thread has encountered one or more parallel region		// 1 bit (InPar) if this thread has encountered one or more parallel region
// 1 bit (IsParConstr) if ICV for a parallel region (false = explicit task)		// 1 bit (IsParConstr) if ICV for a parallel region (false = explicit task)
// 1 bit (InParL2+) if this thread has encountered L2 or higher parallel		// 1 bit (InParL2+) if this thread has encountered L2 or higher parallel
// region		// region
static const uint8_t TaskDescr_SchedMask = (0x1 \| 0x2 \| 0x4);		static const uint8_t TaskDescr_SchedMask = (0x1 \| 0x2 \| 0x4);
static const uint8_t TaskDescr_InPar = 0x10;		static const uint8_t TaskDescr_InPar = 0x10;
static const uint8_t TaskDescr_IsParConstr = 0x20;		static const uint8_t TaskDescr_IsParConstr = 0x20;
static const uint8_t TaskDescr_InParL2P = 0x40;		static const uint8_t TaskDescr_InParL2P = 0x40;

struct SavedLoopDescr_items {
int64_t loopUpperBound;
int64_t nextLowerBound;
int64_t chunk;
int64_t stride;
kmp_sched_t schedule;
} loopData;

struct TaskDescr_items {		struct TaskDescr_items {
uint8_t flags; // 6 bit used (see flag above)		uint8_t flags; // 6 bit used (see flag above)
uint8_t unused;		uint8_t unused;
uint16_t threadId; // thread id		uint16_t threadId; // thread id
uint64_t runtimeChunkSize; // runtime chunk size		uint64_t runtimeChunkSize; // runtime chunk size
} items;		} items;
omptarget_nvptx_TaskDescr *prev;		omptarget_nvptx_TaskDescr *prev;
};		};
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	private:
ALIGN(16)		ALIGN(16)
__kmpc_data_sharing_slot worker_rootS[DS_Max_Warp_Number];		__kmpc_data_sharing_slot worker_rootS[DS_Max_Warp_Number];
};		};

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// thread private data (struct of arrays for better coalescing)		// thread private data (struct of arrays for better coalescing)
// tid refers here to the global thread id		// tid refers here to the global thread id
// do not support multiple concurrent kernel a this time		// do not support multiple concurrent kernel a this time

class omptarget_nvptx_ThreadPrivateContext {		class omptarget_nvptx_ThreadPrivateContext {
public:		public:
// task		// task
INLINE omptarget_nvptx_TaskDescr *Level1TaskDescr(int tid) {		INLINE omptarget_nvptx_TaskDescr *Level1TaskDescr(int tid) {
return &levelOneTaskDescr[tid];		return &levelOneTaskDescr[tid];
}		}
INLINE void SetTopLevelTaskDescr(int tid,		INLINE void SetTopLevelTaskDescr(int tid,
omptarget_nvptx_TaskDescr *taskICV) {		omptarget_nvptx_TaskDescr *taskICV) {
topTaskDescr[tid] = taskICV;		topTaskDescr[tid] = taskICV;
}		}
INLINE omptarget_nvptx_TaskDescr *GetTopLevelTaskDescr(int tid) const;		INLINE omptarget_nvptx_TaskDescr *GetTopLevelTaskDescr(int tid) const;
// parallel		// parallel
INLINE uint16_t &NumThreadsForNextParallel(int tid) {		INLINE uint16_t &NumThreadsForNextParallel(int tid) {
return nextRegion.tnum[tid];		return nextRegion.tnum[tid];
}		}
// schedule (for dispatch)
INLINE kmp_sched_t &ScheduleType(int tid) { return schedule[tid]; }
INLINE int64_t &Chunk(int tid) { return chunk[tid]; }
INLINE int64_t &LoopUpperBound(int tid) { return loopUpperBound[tid]; }
INLINE int64_t &NextLowerBound(int tid) { return nextLowerBound[tid]; }
INLINE int64_t &Stride(int tid) { return stride[tid]; }

INLINE omptarget_nvptx_TeamDescr &TeamContext() { return teamContext; }		INLINE omptarget_nvptx_TeamDescr &TeamContext() { return teamContext; }

INLINE void InitThreadPrivateContext(int tid);		INLINE void InitThreadPrivateContext(int tid);
INLINE uint64_t &Cnt() { return cnt; }		INLINE uint64_t &Cnt() { return cnt; }

private:		private:
// team context for this team		// team context for this team
omptarget_nvptx_TeamDescr teamContext;		omptarget_nvptx_TeamDescr teamContext;
// task ICV for implicit threads in the only parallel region		// task ICV for implicit threads in the only parallel region
omptarget_nvptx_TaskDescr levelOneTaskDescr[MAX_THREADS_PER_TEAM];		omptarget_nvptx_TaskDescr levelOneTaskDescr[MAX_THREADS_PER_TEAM];
// pointer where to find the current task ICV (top of the stack)		// pointer where to find the current task ICV (top of the stack)
omptarget_nvptx_TaskDescr *topTaskDescr[MAX_THREADS_PER_TEAM];		omptarget_nvptx_TaskDescr *topTaskDescr[MAX_THREADS_PER_TEAM];
union {		union {
// Only one of the two is live at the same time.		// Only one of the two is live at the same time.
// parallel		// parallel
uint16_t tnum[MAX_THREADS_PER_TEAM];		uint16_t tnum[MAX_THREADS_PER_TEAM];
} nextRegion;		} nextRegion;
// schedule (for dispatch)		// schedule (for dispatch)
kmp_sched_t schedule[MAX_THREADS_PER_TEAM]; // remember schedule type for #for
int64_t chunk[MAX_THREADS_PER_TEAM];
int64_t loopUpperBound[MAX_THREADS_PER_TEAM];
// state for dispatch with dyn/guided OR static (never use both at a time)
int64_t nextLowerBound[MAX_THREADS_PER_TEAM];
int64_t stride[MAX_THREADS_PER_TEAM];
uint64_t cnt;		uint64_t cnt;
};		};

/// Memory manager for statically allocated memory.		/// Memory manager for statically allocated memory.
class omptarget_nvptx_SimpleMemoryManager {		class omptarget_nvptx_SimpleMemoryManager {
private:		private:
struct MemDataTy {		struct MemDataTy {
volatile unsigned keys[OMP_STATE_COUNT];		volatile unsigned keys[OMP_STATE_COUNT];
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/omptargeti.h

	Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines

	INLINE void omptarget_nvptx_TaskDescr::CopyConvergentParent(			INLINE void omptarget_nvptx_TaskDescr::CopyConvergentParent(
	omptarget_nvptx_TaskDescr *parentTaskDescr, uint16_t tid, uint16_t tnum) {			omptarget_nvptx_TaskDescr *parentTaskDescr, uint16_t tid, uint16_t tnum) {
	CopyParent(parentTaskDescr);			CopyParent(parentTaskDescr);
	items.flags \|= TaskDescr_InParL2P; // In L2+ parallelism			items.flags \|= TaskDescr_InParL2P; // In L2+ parallelism
	items.threadId = tid;			items.threadId = tid;
	}			}

	INLINE void omptarget_nvptx_TaskDescr::SaveLoopData() {
	loopData.loopUpperBound =
	omptarget_nvptx_threadPrivateContext->LoopUpperBound(items.threadId);
	loopData.nextLowerBound =
	omptarget_nvptx_threadPrivateContext->NextLowerBound(items.threadId);
	loopData.schedule =
	omptarget_nvptx_threadPrivateContext->ScheduleType(items.threadId);
	loopData.chunk = omptarget_nvptx_threadPrivateContext->Chunk(items.threadId);
	loopData.stride =
	omptarget_nvptx_threadPrivateContext->Stride(items.threadId);
	}

	INLINE void omptarget_nvptx_TaskDescr::RestoreLoopData() const {
	omptarget_nvptx_threadPrivateContext->Chunk(items.threadId) = loopData.chunk;
	omptarget_nvptx_threadPrivateContext->LoopUpperBound(items.threadId) =
	loopData.loopUpperBound;
	omptarget_nvptx_threadPrivateContext->NextLowerBound(items.threadId) =
	loopData.nextLowerBound;
	omptarget_nvptx_threadPrivateContext->Stride(items.threadId) =
	loopData.stride;
	omptarget_nvptx_threadPrivateContext->ScheduleType(items.threadId) =
	loopData.schedule;
	}

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// Thread Private Context			// Thread Private Context
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////

	INLINE omptarget_nvptx_TaskDescr *			INLINE omptarget_nvptx_TaskDescr *
	omptarget_nvptx_ThreadPrivateContext::GetTopLevelTaskDescr(int tid) const {			omptarget_nvptx_ThreadPrivateContext::GetTopLevelTaskDescr(int tid) const {
	ASSERT0(			ASSERT0(
	LT_FUSSY, tid < MAX_THREADS_PER_TEAM,			LT_FUSSY, tid < MAX_THREADS_PER_TEAM,
	▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/loop.cu

Show All 11 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
#pragma omp declare target		#pragma omp declare target

#include "common/omptarget.h"		#include "common/omptarget.h"
#include "target/shuffle.h"		#include "target/shuffle.h"
#include "target_impl.h"		#include "target_impl.h"

		struct DynamicScheduleTracker {
		int64_t Chunk;
		int64_t LoopUpperBound;
		int64_t NextLowerBound;
		int64_t Stride;
		kmp_sched_t ScheduleType;
		DynamicScheduleTracker *NextDST;
		};

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// template class that encapsulate all the helper functions		// template class that encapsulate all the helper functions
//		//
// T is loop iteration type (32 \| 64) (unsigned \| signed)		// T is loop iteration type (32 \| 64) (unsigned \| signed)
// ST is the signed version of T		// ST is the signed version of T
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
▲ Show 20 Lines • Show All 170 Lines • ▼ Show 20 Lines	public:

INLINE static int OrderedSchedule(kmp_sched_t schedule) {		INLINE static int OrderedSchedule(kmp_sched_t schedule) {
return schedule >= kmp_sched_ordered_first &&		return schedule >= kmp_sched_ordered_first &&
schedule <= kmp_sched_ordered_last;		schedule <= kmp_sched_ordered_last;
}		}

INLINE static void dispatch_init(kmp_Ident *loc, int32_t threadId,		INLINE static void dispatch_init(kmp_Ident *loc, int32_t threadId,
kmp_sched_t schedule, T lb, T ub, ST st,		kmp_sched_t schedule, T lb, T ub, ST st,
ST chunk) {		ST chunk, DynamicScheduleTracker *DST) {
if (checkRuntimeUninitialized(loc)) {		if (checkRuntimeUninitialized(loc)) {
// In SPMD mode no need to check parallelism level - dynamic scheduling		// In SPMD mode no need to check parallelism level - dynamic scheduling
// may appear only in L2 parallel regions with lightweight runtime.		// may appear only in L2 parallel regions with lightweight runtime.
ASSERT0(LT_FUSSY, checkSPMDMode(loc), "Expected non-SPMD mode.");		ASSERT0(LT_FUSSY, checkSPMDMode(loc), "Expected non-SPMD mode.");
return;		return;
}		}
int tid = GetLogicalThreadIdInBlock(checkSPMDMode(loc));		int tid = GetLogicalThreadIdInBlock(checkSPMDMode(loc));
omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor(tid);		omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor(tid);
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	if (tnum == 1 \|\| tripCount <= 1 \|\| OrderedSchedule(schedule)) {
"unknown schedule %d & chunk %lld\n", (int)schedule,		"unknown schedule %d & chunk %lld\n", (int)schedule,
(long long)chunk);		(long long)chunk);
}		}

// init schedules		// init schedules
if (schedule == kmp_sched_static_chunk) {		if (schedule == kmp_sched_static_chunk) {
ASSERT0(LT_FUSSY, chunk > 0, "bad chunk value");		ASSERT0(LT_FUSSY, chunk > 0, "bad chunk value");
// save sched state		// save sched state
omptarget_nvptx_threadPrivateContext->ScheduleType(tid) = schedule;		DST->ScheduleType = schedule;
// save ub		// save ub
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid) = ub;		DST->LoopUpperBound = ub;
// compute static chunk		// compute static chunk
ST stride;		ST stride;
int lastiter = 0;		int lastiter = 0;
ForStaticChunk(lastiter, lb, ub, stride, chunk, threadId, tnum);		ForStaticChunk(lastiter, lb, ub, stride, chunk, threadId, tnum);
// save computed params		// save computed params
omptarget_nvptx_threadPrivateContext->Chunk(tid) = chunk;		DST->Chunk = chunk;
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid) = lb;		DST->NextLowerBound = lb;
omptarget_nvptx_threadPrivateContext->Stride(tid) = stride;		DST->Stride = stride;
PRINT(LD_LOOP,		PRINT(LD_LOOP,
"dispatch init (static chunk) : num threads = %d, ub = %" PRId64		"dispatch init (static chunk) : num threads = %d, ub = %" PRId64
", next lower bound = %llu, stride = %llu\n",		", next lower bound = %llu, stride = %llu\n",
(int)tnum,		(int)tnum, DST->LoopUpperBound,
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid),		(unsigned long long)DST->NextLowerBound,
(unsigned long long)		(unsigned long long)DST->Stride);
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid),
(unsigned long long)omptarget_nvptx_threadPrivateContext->Stride(
tid));
} else if (schedule == kmp_sched_static_balanced_chunk) {		} else if (schedule == kmp_sched_static_balanced_chunk) {
ASSERT0(LT_FUSSY, chunk > 0, "bad chunk value");		ASSERT0(LT_FUSSY, chunk > 0, "bad chunk value");
// save sched state		// save sched state
omptarget_nvptx_threadPrivateContext->ScheduleType(tid) = schedule;		DST->ScheduleType = schedule;
// save ub		// save ub
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid) = ub;		DST->LoopUpperBound = ub;
// compute static chunk		// compute static chunk
ST stride;		ST stride;
int lastiter = 0;		int lastiter = 0;
// round up to make sure the chunk is enough to cover all iterations		// round up to make sure the chunk is enough to cover all iterations
T span = (tripCount + tnum - 1) / tnum;		T span = (tripCount + tnum - 1) / tnum;
// perform chunk adjustment		// perform chunk adjustment
chunk = (span + chunk - 1) & ~(chunk - 1);		chunk = (span + chunk - 1) & ~(chunk - 1);

T oldUb = ub;		T oldUb = ub;
ForStaticChunk(lastiter, lb, ub, stride, chunk, threadId, tnum);		ForStaticChunk(lastiter, lb, ub, stride, chunk, threadId, tnum);
ASSERT0(LT_FUSSY, ub >= lb, "ub must be >= lb.");		ASSERT0(LT_FUSSY, ub >= lb, "ub must be >= lb.");
if (ub > oldUb)		if (ub > oldUb)
ub = oldUb;		ub = oldUb;
// save computed params		// save computed params
omptarget_nvptx_threadPrivateContext->Chunk(tid) = chunk;		DST->Chunk = chunk;
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid) = lb;		DST->NextLowerBound = lb;
omptarget_nvptx_threadPrivateContext->Stride(tid) = stride;		DST->Stride = stride;
PRINT(LD_LOOP,		PRINT(LD_LOOP,
"dispatch init (static chunk) : num threads = %d, ub = %" PRId64		"dispatch init (static chunk) : num threads = %d, ub = %" PRId64
", next lower bound = %llu, stride = %llu\n",		", next lower bound = %llu, stride = %llu\n",
(int)tnum,		(int)tnum, DST->LoopUpperBound,
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid),		(unsigned long long)DST->NextLowerBound,
(unsigned long long)
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid),
(unsigned long long)omptarget_nvptx_threadPrivateContext->Stride(		(unsigned long long)omptarget_nvptx_threadPrivateContext->Stride(
tid));		tid));
} else if (schedule == kmp_sched_static_nochunk) {		} else if (schedule == kmp_sched_static_nochunk) {
ASSERT0(LT_FUSSY, chunk == 0, "bad chunk value");		ASSERT0(LT_FUSSY, chunk == 0, "bad chunk value");
// save sched state		// save sched state
omptarget_nvptx_threadPrivateContext->ScheduleType(tid) = schedule;		DST->ScheduleType = schedule;
// save ub		// save ub
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid) = ub;		DST->LoopUpperBound = ub;
// compute static chunk		// compute static chunk
ST stride;		ST stride;
int lastiter = 0;		int lastiter = 0;
ForStaticNoChunk(lastiter, lb, ub, stride, chunk, threadId, tnum);		ForStaticNoChunk(lastiter, lb, ub, stride, chunk, threadId, tnum);
// save computed params		// save computed params
omptarget_nvptx_threadPrivateContext->Chunk(tid) = chunk;		DST->Chunk = chunk;
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid) = lb;		DST->NextLowerBound = lb;
omptarget_nvptx_threadPrivateContext->Stride(tid) = stride;		DST->Stride = stride;
PRINT(LD_LOOP,		PRINT(LD_LOOP,
"dispatch init (static nochunk) : num threads = %d, ub = %" PRId64		"dispatch init (static nochunk) : num threads = %d, ub = %" PRId64
", next lower bound = %llu, stride = %llu\n",		", next lower bound = %llu, stride = %llu\n",
(int)tnum,		(int)tnum, DST->LoopUpperBound,
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid),		(unsigned long long)DST->NextLowerBound,
(unsigned long long)
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid),
(unsigned long long)omptarget_nvptx_threadPrivateContext->Stride(		(unsigned long long)omptarget_nvptx_threadPrivateContext->Stride(
tid));		tid));
} else if (schedule == kmp_sched_dynamic \|\| schedule == kmp_sched_guided) {		} else if (schedule == kmp_sched_dynamic \|\| schedule == kmp_sched_guided) {
// save data		// save data
omptarget_nvptx_threadPrivateContext->ScheduleType(tid) = schedule;		DST->ScheduleType = schedule;
if (chunk < 1)		if (chunk < 1)
chunk = 1;		chunk = 1;
omptarget_nvptx_threadPrivateContext->Chunk(tid) = chunk;		DST->Chunk = chunk;
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid) = ub;		DST->LoopUpperBound = ub;
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid) = lb;		DST->NextLowerBound = lb;
__kmpc_barrier(loc, threadId);		__kmpc_barrier(loc, threadId);
if (tid == 0) {		if (tid == 0) {
omptarget_nvptx_threadPrivateContext->Cnt() = 0;		omptarget_nvptx_threadPrivateContext->Cnt() = 0;
__kmpc_impl_threadfence_block();		__kmpc_impl_threadfence_block();
}		}
__kmpc_barrier(loc, threadId);		__kmpc_barrier(loc, threadId);
PRINT(LD_LOOP,		PRINT(LD_LOOP,
"dispatch init (dyn) : num threads = %d, lb = %llu, ub = %" PRId64		"dispatch init (dyn) : num threads = %d, lb = %llu, ub = %" PRId64
", chunk %" PRIu64 "\n",		", chunk %" PRIu64 "\n",
(int)tnum,		(int)tnum, (unsigned long long)DST->NextLowerBound,
(unsigned long long)		DST->LoopUpperBound, DST->Chunk);
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid),
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid),
omptarget_nvptx_threadPrivateContext->Chunk(tid));
}		}
}		}

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// Support for dispatch next		// Support for dispatch next

INLINE static uint64_t Shuffle(__kmpc_impl_lanemask_t active, int64_t val,		INLINE static uint64_t Shuffle(__kmpc_impl_lanemask_t active, int64_t val,
int leader) {		int leader) {
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	INLINE static int DynamicNextChunk(T &lb, T &ub, T chunkSize,
lb = loopUpperBound + 2;		lb = loopUpperBound + 2;
ub = loopUpperBound + 1;		ub = loopUpperBound + 1;
PRINT(LD_LOOPD, "lb %lld, ub %lld, loop ub %lld; finished\n", (long long)lb,		PRINT(LD_LOOPD, "lb %lld, ub %lld, loop ub %lld; finished\n", (long long)lb,
(long long)ub, (long long)loopUpperBound);		(long long)ub, (long long)loopUpperBound);
return FINISHED;		return FINISHED;
}		}

INLINE static int dispatch_next(kmp_Ident loc, int32_t gtid, int32_t plast,		INLINE static int dispatch_next(kmp_Ident loc, int32_t gtid, int32_t plast,
T plower, T pupper, ST *pstride) {		T plower, T pupper, ST *pstride,
		DynamicScheduleTracker *DST) {
if (checkRuntimeUninitialized(loc)) {		if (checkRuntimeUninitialized(loc)) {
// In SPMD mode no need to check parallelism level - dynamic scheduling		// In SPMD mode no need to check parallelism level - dynamic scheduling
// may appear only in L2 parallel regions with lightweight runtime.		// may appear only in L2 parallel regions with lightweight runtime.
ASSERT0(LT_FUSSY, checkSPMDMode(loc), "Expected non-SPMD mode.");		ASSERT0(LT_FUSSY, checkSPMDMode(loc), "Expected non-SPMD mode.");
if (*plast)		if (*plast)
return DISPATCH_FINISHED;		return DISPATCH_FINISHED;
*plast = 1;		*plast = 1;
return DISPATCH_NOTFINISHED;		return DISPATCH_NOTFINISHED;
}		}
// ID of a thread in its own warp		// ID of a thread in its own warp

// automatically selects thread or warp ID based on selected implementation		// automatically selects thread or warp ID based on selected implementation
int tid = GetLogicalThreadIdInBlock(checkSPMDMode(loc));		int tid = GetLogicalThreadIdInBlock(checkSPMDMode(loc));
ASSERT0(LT_FUSSY, gtid < GetNumberOfOmpThreads(checkSPMDMode(loc)),		ASSERT0(LT_FUSSY, gtid < GetNumberOfOmpThreads(checkSPMDMode(loc)),
"current thread is not needed here; error");		"current thread is not needed here; error");
// retrieve schedule		// retrieve schedule
kmp_sched_t schedule =		kmp_sched_t schedule = DST->ScheduleType;
omptarget_nvptx_threadPrivateContext->ScheduleType(tid);

// xxx reduce to one		// xxx reduce to one
if (schedule == kmp_sched_static_chunk \|\|		if (schedule == kmp_sched_static_chunk \|\|
schedule == kmp_sched_static_nochunk) {		schedule == kmp_sched_static_nochunk) {
T myLb = omptarget_nvptx_threadPrivateContext->NextLowerBound(tid);		T myLb = DST->NextLowerBound;
T ub = omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid);		T ub = DST->LoopUpperBound;
// finished?		// finished?
if (myLb > ub) {		if (myLb > ub) {
PRINT(LD_LOOP, "static loop finished with myLb %lld, ub %lld\n",		PRINT(LD_LOOP, "static loop finished with myLb %lld, ub %lld\n",
(long long)myLb, (long long)ub);		(long long)myLb, (long long)ub);
return DISPATCH_FINISHED;		return DISPATCH_FINISHED;
}		}
// not finished, save current bounds		// not finished, save current bounds
ST chunk = omptarget_nvptx_threadPrivateContext->Chunk(tid);		ST chunk = DST->Chunk;
*plower = myLb;		*plower = myLb;
T myUb = myLb + chunk - 1; // Clang uses i <= ub		T myUb = myLb + chunk - 1; // Clang uses i <= ub
if (myUb > ub)		if (myUb > ub)
myUb = ub;		myUb = ub;
*pupper = myUb;		*pupper = myUb;
*plast = (int32_t)(myUb == ub);		*plast = (int32_t)(myUb == ub);

// increment next lower bound by the stride		// increment next lower bound by the stride
ST stride = omptarget_nvptx_threadPrivateContext->Stride(tid);		ST stride = DST->Stride;
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid) = myLb + stride;		DST->NextLowerBound = myLb + stride;
PRINT(LD_LOOP, "static loop continues with myLb %lld, myUb %lld\n",		PRINT(LD_LOOP, "static loop continues with myLb %lld, myUb %lld\n",
(long long)plower, (long long)pupper);		(long long)plower, (long long)pupper);
return DISPATCH_NOTFINISHED;		return DISPATCH_NOTFINISHED;
}		}
ASSERT0(LT_FUSSY,		ASSERT0(LT_FUSSY,
schedule == kmp_sched_dynamic \|\| schedule == kmp_sched_guided,		schedule == kmp_sched_dynamic \|\| schedule == kmp_sched_guided,
"bad sched");		"bad sched");
T myLb, myUb;		T myLb, myUb;
int finished = DynamicNextChunk(		int finished = DynamicNextChunk(myLb, myUb, DST->Chunk, DST->NextLowerBound,
myLb, myUb, omptarget_nvptx_threadPrivateContext->Chunk(tid),		DST->LoopUpperBound);
omptarget_nvptx_threadPrivateContext->NextLowerBound(tid),
omptarget_nvptx_threadPrivateContext->LoopUpperBound(tid));

if (finished == FINISHED)		if (finished == FINISHED)
return DISPATCH_FINISHED;		return DISPATCH_FINISHED;

// not finished (either not finished or last chunk)		// not finished (either not finished or last chunk)
*plast = (int32_t)(finished == LAST_CHUNK);		*plast = (int32_t)(finished == LAST_CHUNK);
*plower = myLb;		*plower = myLb;
*pupper = myUb;		*pupper = myUb;
Show All 16 Lines	public:
// end of template class that encapsulate all the helper functions		// end of template class that encapsulate all the helper functions
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
};		};

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// KMP interface implementation (dyn loops)		// KMP interface implementation (dyn loops)
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

		// TODO: This is a stopgap. We probably want to expand the dispatch API to take
		// an DST pointer which can then be allocated properly without malloc.
		DynamicScheduleTracker *THREAD_LOCAL(ThreadDSTPtr);

		// Create a new DST, link the current one, and define the new as current.
		static DynamicScheduleTracker *pushDST() {
		DynamicScheduleTracker NewDST = static_cast<DynamicScheduleTracker >(
		SafeMalloc(sizeof(DynamicScheduleTracker), "new DST"));
		*NewDST = DynamicScheduleTracker({0});
		NewDST->NextDST = ThreadDSTPtr;
		ThreadDSTPtr = NewDST;
		return ThreadDSTPtr;
		}

		// Return the current DST.
		static DynamicScheduleTracker *peekDST() { return ThreadDSTPtr; }

		// Pop the current DST and restore the last one.
		static void popDST() {
		DynamicScheduleTracker *OldDST = ThreadDSTPtr->NextDST;
		SafeFree(ThreadDSTPtr, "remove DST");
		ThreadDSTPtr = OldDST;
		}

// init		// init
EXTERN void __kmpc_dispatch_init_4(kmp_Ident *loc, int32_t tid,		EXTERN void __kmpc_dispatch_init_4(kmp_Ident *loc, int32_t tid,
int32_t schedule, int32_t lb, int32_t ub,		int32_t schedule, int32_t lb, int32_t ub,
int32_t st, int32_t chunk) {		int32_t st, int32_t chunk) {
PRINT0(LD_IO, "call kmpc_dispatch_init_4\n");		PRINT0(LD_IO, "call kmpc_dispatch_init_4\n");
		DynamicScheduleTracker *DST = pushDST();
omptarget_nvptx_LoopSupport<int32_t, int32_t>::dispatch_init(		omptarget_nvptx_LoopSupport<int32_t, int32_t>::dispatch_init(
loc, tid, (kmp_sched_t)schedule, lb, ub, st, chunk);		loc, tid, (kmp_sched_t)schedule, lb, ub, st, chunk, DST);
}		}

EXTERN void __kmpc_dispatch_init_4u(kmp_Ident *loc, int32_t tid,		EXTERN void __kmpc_dispatch_init_4u(kmp_Ident *loc, int32_t tid,
int32_t schedule, uint32_t lb, uint32_t ub,		int32_t schedule, uint32_t lb, uint32_t ub,
int32_t st, int32_t chunk) {		int32_t st, int32_t chunk) {
PRINT0(LD_IO, "call kmpc_dispatch_init_4u\n");		PRINT0(LD_IO, "call kmpc_dispatch_init_4u\n");
		DynamicScheduleTracker *DST = pushDST();
omptarget_nvptx_LoopSupport<uint32_t, int32_t>::dispatch_init(		omptarget_nvptx_LoopSupport<uint32_t, int32_t>::dispatch_init(
loc, tid, (kmp_sched_t)schedule, lb, ub, st, chunk);		loc, tid, (kmp_sched_t)schedule, lb, ub, st, chunk, DST);
}		}

EXTERN void __kmpc_dispatch_init_8(kmp_Ident *loc, int32_t tid,		EXTERN void __kmpc_dispatch_init_8(kmp_Ident *loc, int32_t tid,
int32_t schedule, int64_t lb, int64_t ub,		int32_t schedule, int64_t lb, int64_t ub,
int64_t st, int64_t chunk) {		int64_t st, int64_t chunk) {
PRINT0(LD_IO, "call kmpc_dispatch_init_8\n");		PRINT0(LD_IO, "call kmpc_dispatch_init_8\n");
		DynamicScheduleTracker *DST = pushDST();
omptarget_nvptx_LoopSupport<int64_t, int64_t>::dispatch_init(		omptarget_nvptx_LoopSupport<int64_t, int64_t>::dispatch_init(
loc, tid, (kmp_sched_t)schedule, lb, ub, st, chunk);		loc, tid, (kmp_sched_t)schedule, lb, ub, st, chunk, DST);
}		}

EXTERN void __kmpc_dispatch_init_8u(kmp_Ident *loc, int32_t tid,		EXTERN void __kmpc_dispatch_init_8u(kmp_Ident *loc, int32_t tid,
int32_t schedule, uint64_t lb, uint64_t ub,		int32_t schedule, uint64_t lb, uint64_t ub,
int64_t st, int64_t chunk) {		int64_t st, int64_t chunk) {
PRINT0(LD_IO, "call kmpc_dispatch_init_8u\n");		PRINT0(LD_IO, "call kmpc_dispatch_init_8u\n");
		DynamicScheduleTracker *DST = pushDST();
omptarget_nvptx_LoopSupport<uint64_t, int64_t>::dispatch_init(		omptarget_nvptx_LoopSupport<uint64_t, int64_t>::dispatch_init(
loc, tid, (kmp_sched_t)schedule, lb, ub, st, chunk);		loc, tid, (kmp_sched_t)schedule, lb, ub, st, chunk, DST);
}		}

// next		// next
EXTERN int __kmpc_dispatch_next_4(kmp_Ident loc, int32_t tid, int32_t p_last,		EXTERN int __kmpc_dispatch_next_4(kmp_Ident loc, int32_t tid, int32_t p_last,
int32_t p_lb, int32_t p_ub, int32_t *p_st) {		int32_t p_lb, int32_t p_ub, int32_t *p_st) {
PRINT0(LD_IO, "call kmpc_dispatch_next_4\n");		PRINT0(LD_IO, "call kmpc_dispatch_next_4\n");
		DynamicScheduleTracker *DST = peekDST();
return omptarget_nvptx_LoopSupport<int32_t, int32_t>::dispatch_next(		return omptarget_nvptx_LoopSupport<int32_t, int32_t>::dispatch_next(
loc, tid, p_last, p_lb, p_ub, p_st);		loc, tid, p_last, p_lb, p_ub, p_st, DST);
}		}

EXTERN int __kmpc_dispatch_next_4u(kmp_Ident loc, int32_t tid, int32_t p_last,		EXTERN int __kmpc_dispatch_next_4u(kmp_Ident loc, int32_t tid, int32_t p_last,
uint32_t p_lb, uint32_t p_ub,		uint32_t p_lb, uint32_t p_ub,
int32_t *p_st) {		int32_t *p_st) {
PRINT0(LD_IO, "call kmpc_dispatch_next_4u\n");		PRINT0(LD_IO, "call kmpc_dispatch_next_4u\n");
		DynamicScheduleTracker *DST = peekDST();
return omptarget_nvptx_LoopSupport<uint32_t, int32_t>::dispatch_next(		return omptarget_nvptx_LoopSupport<uint32_t, int32_t>::dispatch_next(
loc, tid, p_last, p_lb, p_ub, p_st);		loc, tid, p_last, p_lb, p_ub, p_st, DST);
}		}

EXTERN int __kmpc_dispatch_next_8(kmp_Ident loc, int32_t tid, int32_t p_last,		EXTERN int __kmpc_dispatch_next_8(kmp_Ident loc, int32_t tid, int32_t p_last,
int64_t p_lb, int64_t p_ub, int64_t *p_st) {		int64_t p_lb, int64_t p_ub, int64_t *p_st) {
PRINT0(LD_IO, "call kmpc_dispatch_next_8\n");		PRINT0(LD_IO, "call kmpc_dispatch_next_8\n");
		DynamicScheduleTracker *DST = peekDST();
return omptarget_nvptx_LoopSupport<int64_t, int64_t>::dispatch_next(		return omptarget_nvptx_LoopSupport<int64_t, int64_t>::dispatch_next(
loc, tid, p_last, p_lb, p_ub, p_st);		loc, tid, p_last, p_lb, p_ub, p_st, DST);
}		}

EXTERN int __kmpc_dispatch_next_8u(kmp_Ident loc, int32_t tid, int32_t p_last,		EXTERN int __kmpc_dispatch_next_8u(kmp_Ident loc, int32_t tid, int32_t p_last,
uint64_t p_lb, uint64_t p_ub,		uint64_t p_lb, uint64_t p_ub,
int64_t *p_st) {		int64_t *p_st) {
PRINT0(LD_IO, "call kmpc_dispatch_next_8u\n");		PRINT0(LD_IO, "call kmpc_dispatch_next_8u\n");
		DynamicScheduleTracker *DST = peekDST();
return omptarget_nvptx_LoopSupport<uint64_t, int64_t>::dispatch_next(		return omptarget_nvptx_LoopSupport<uint64_t, int64_t>::dispatch_next(
loc, tid, p_last, p_lb, p_ub, p_st);		loc, tid, p_last, p_lb, p_ub, p_st, DST);
}		}

// fini		// fini
EXTERN void __kmpc_dispatch_fini_4(kmp_Ident *loc, int32_t tid) {		EXTERN void __kmpc_dispatch_fini_4(kmp_Ident *loc, int32_t tid) {
PRINT0(LD_IO, "call kmpc_dispatch_fini_4\n");		PRINT0(LD_IO, "call kmpc_dispatch_fini_4\n");
omptarget_nvptx_LoopSupport<int32_t, int32_t>::dispatch_fini();		omptarget_nvptx_LoopSupport<int32_t, int32_t>::dispatch_fini();
		popDST();
}		}

EXTERN void __kmpc_dispatch_fini_4u(kmp_Ident *loc, int32_t tid) {		EXTERN void __kmpc_dispatch_fini_4u(kmp_Ident *loc, int32_t tid) {
PRINT0(LD_IO, "call kmpc_dispatch_fini_4u\n");		PRINT0(LD_IO, "call kmpc_dispatch_fini_4u\n");
omptarget_nvptx_LoopSupport<uint32_t, int32_t>::dispatch_fini();		omptarget_nvptx_LoopSupport<uint32_t, int32_t>::dispatch_fini();
		popDST();
}		}

EXTERN void __kmpc_dispatch_fini_8(kmp_Ident *loc, int32_t tid) {		EXTERN void __kmpc_dispatch_fini_8(kmp_Ident *loc, int32_t tid) {
PRINT0(LD_IO, "call kmpc_dispatch_fini_8\n");		PRINT0(LD_IO, "call kmpc_dispatch_fini_8\n");
omptarget_nvptx_LoopSupport<int64_t, int64_t>::dispatch_fini();		omptarget_nvptx_LoopSupport<int64_t, int64_t>::dispatch_fini();
		popDST();
}		}

EXTERN void __kmpc_dispatch_fini_8u(kmp_Ident *loc, int32_t tid) {		EXTERN void __kmpc_dispatch_fini_8u(kmp_Ident *loc, int32_t tid) {
PRINT0(LD_IO, "call kmpc_dispatch_fini_8u\n");		PRINT0(LD_IO, "call kmpc_dispatch_fini_8u\n");
omptarget_nvptx_LoopSupport<uint64_t, int64_t>::dispatch_fini();		omptarget_nvptx_LoopSupport<uint64_t, int64_t>::dispatch_fini();
		popDST();
}		}

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// KMP interface implementation (static loops)		// KMP interface implementation (static loops)
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

EXTERN void __kmpc_for_static_init_4(kmp_Ident *loc, int32_t global_tid,		EXTERN void __kmpc_for_static_init_4(kmp_Ident *loc, int32_t global_tid,
int32_t schedtype, int32_t *plastiter,		int32_t schedtype, int32_t *plastiter,
▲ Show 20 Lines • Show All 143 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/common/src/parallel.cu

Show First 20 Lines • Show All 210 Lines • ▼ Show 20 Lines	EXTERN void __kmpc_serialized_parallel(kmp_Ident *loc, uint32_t global_tid) {
// assume this is only called for nested parallel		// assume this is only called for nested parallel
int threadId = GetLogicalThreadIdInBlock(checkSPMDMode(loc));		int threadId = GetLogicalThreadIdInBlock(checkSPMDMode(loc));

// unlike actual parallel, threads in the same team do not share		// unlike actual parallel, threads in the same team do not share
// the workTaskDescr in this case and num threads is fixed to 1		// the workTaskDescr in this case and num threads is fixed to 1

// get current task		// get current task
omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor(threadId);		omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor(threadId);
currTaskDescr->SaveLoopData();

// allocate new task descriptor and copy value from current one, set prev to		// allocate new task descriptor and copy value from current one, set prev to
// it		// it
omptarget_nvptx_TaskDescr *newTaskDescr =		omptarget_nvptx_TaskDescr *newTaskDescr =
(omptarget_nvptx_TaskDescr *)SafeMalloc(sizeof(omptarget_nvptx_TaskDescr),		(omptarget_nvptx_TaskDescr *)SafeMalloc(sizeof(omptarget_nvptx_TaskDescr),
"new seq parallel task");		"new seq parallel task");
newTaskDescr->CopyParent(currTaskDescr);		newTaskDescr->CopyParent(currTaskDescr);

Show All 23 Lines	EXTERN void __kmpc_end_serialized_parallel(kmp_Ident *loc,
int threadId = GetLogicalThreadIdInBlock(checkSPMDMode(loc));		int threadId = GetLogicalThreadIdInBlock(checkSPMDMode(loc));
omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor(threadId);		omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor(threadId);
// set new top		// set new top
omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr(		omptarget_nvptx_threadPrivateContext->SetTopLevelTaskDescr(
threadId, currTaskDescr->GetPrevTaskDescr());		threadId, currTaskDescr->GetPrevTaskDescr());
// free		// free
SafeFree(currTaskDescr, "new seq parallel task");		SafeFree(currTaskDescr, "new seq parallel task");
currTaskDescr = getMyTopTaskDescriptor(threadId);		currTaskDescr = getMyTopTaskDescriptor(threadId);
currTaskDescr->RestoreLoopData();
}		}

EXTERN uint16_t __kmpc_parallel_level(kmp_Ident *loc, uint32_t global_tid) {		EXTERN uint16_t __kmpc_parallel_level(kmp_Ident *loc, uint32_t global_tid) {
PRINT0(LD_IO, "call to __kmpc_parallel_level\n");		PRINT0(LD_IO, "call to __kmpc_parallel_level\n");

return parallelLevel[GetWarpId()] & (OMP_ACTIVE_PARALLEL_LEVEL - 1);		return parallelLevel[GetWarpId()] & (OMP_ACTIVE_PARALLEL_LEVEL - 1);
}		}

Show All 37 Lines