This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
polly/trunk/
-
trunk/
-
CMakeLists.txt
-
include/polly/
-
polly/
-
CodeGen/
-
PPCGCodeGeneration.h
-
LinkAllPasses.h
-
lib/
-
CodeGen/
-
PPCGCodeGeneration.cpp
-
Support/
-
RegisterPasses.cpp
-
test/GPGPU/
-
GPGPU/
-
cuda-managed-memory-simple.ll
-
size-cast.ll
-
tools/
-
CMakeLists.txt
-
GPURuntime/
-
GPUJIT.h
-
GPUJIT.c

Differential D32431

[Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
ClosedPublic

Authored by PhilippSchaad on Apr 24 2017, 6:39 AM.

Download Raw Diff

Details

Reviewers

grosser
bollu
Meinersbur
etherzhhb
singam-sanjay

Commits

rG17f01968f118: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rG51904ae35aad: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rPLO302379: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rPLO302215: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rL302379: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rL302215: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen

Summary

When compiling for GPU, one can now choose to compile for OpenCL or CUDA,
with the corresponding polly-gpu-runtime flag (libopencl / libcudart). The
GPURuntime library (GPUJIT) has been extended with the OpenCL Runtime library
for that purpose, correctly choosing the corresponding library calls to the
option chosen when compiling (via different initialization calls).

Additionally, a specific GPU Target architecture can now be chosen with -polly-gpu-arch (only nvptx64 implemented thus far).

Diff Detail

Repository: rL LLVM

Event Timeline

PhilippSchaad created this revision.Apr 24 2017, 6:39 AM

Herald added subscribers: Anastasia, yaxunl, mgorny, nemanjai. · View Herald TranscriptApr 24 2017, 6:39 AM

PhilippSchaad retitled this revision from Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen to [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen.Apr 24 2017, 6:44 AM

PhilippSchaad set the repository for this revision to rL LLVM.

PhilippSchaad added a project: Restricted Project.

PhilippSchaad added subscribers: pollydev, llvm-commits.

Replaced magic numbers, added assertions and fixed if-braces.

PhilippSchaad added reviewers: Meinersbur, etherzhhb.Apr 25 2017, 12:39 AM

I wrote a runtime with similar scope here: https://github.com/Meinersbur/prl . We were one discussing to use it for Polly as well. What's the status of that?

lib/CodeGen/PPCGCodeGeneration.cpp
56–58 ↗	(On Diff #96395)	Did you consider an enum?
1562 ↗	(On Diff #96395)	Is there some vendor-neutral triple?
1656 ↗	(On Diff #96395)	Why a static flag?
2530–2538 ↗	(On Diff #96395)	Did you consider Pass *polly::createPPCGCodeGenerationPass(int Runtime); ?
lib/Support/RegisterPasses.cpp
312–319 ↗	(On Diff #96395)	A switch instead?

In D32431#736600, @Meinersbur wrote:

I wrote a runtime with similar scope here: https://github.com/Meinersbur/prl . We were one discussing to use it for Polly as well. What's the status of that?

I have looked into it a tiny little bit about a month ago, but had then decided to write a basic OpenCL Runtime from scratch in GPUJIT. So to my knowledge, nothing has changed on that status yet.

Currently looking into the rest of your comment-mentioned points.

lib/CodeGen/PPCGCodeGeneration.cpp
1562 ↗	(On Diff #96395)	Do you mean like `nvptx64-nvcl` / `nvptx64-cuda`?

PhilippSchaad added inline comments.Apr 25 2017, 9:38 AM

lib/CodeGen/PPCGCodeGeneration.cpp
2530–2538 ↗	(On Diff #96395)	That seems reasonable, but I get a template-conflict for the LLVM Pass-Creation template when trying to change the pass-creation-method structure. I thought it might be easier this way?

PhilippSchaad added inline comments.Apr 25 2017, 9:45 AM

lib/CodeGen/PPCGCodeGeneration.cpp
2530–2538 ↗	(On Diff #96395)	Correction: looking at wrong function of course, you mean a different one :-)

Stylistic changes and switch to -polly-gpu-runtime=cuda/opencl compiler flag

PhilippSchaad marked 6 inline comments as done.Apr 25 2017, 12:20 PM

Removed left over commented out macros

Meinersbur added inline comments.Apr 25 2017, 1:36 PM

lib/CodeGen/PPCGCodeGeneration.cpp
1562 ↗	(On Diff #96395)	I hoped that there might be some kind of triple that works for OpenCL in general, not only for nvidia (`nvptx`, `nvcl`). If the generated program only works for devices that support cuda anyway, I don't see where the benefit of such a backend is. If there is indeed no backend that also works on non-nvidia devices, should we call the the runtime accordingly, e.g. "nvcl" then?
58–60 ↗	(On Diff #96620)	See http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly for LLVM's coding policy for enum members. Nitpick: A "T" suffix is rather unusual.
156 ↗	(On Diff #96620)	Nitpick: No need to use an `enum` qualifier.
lib/Support/RegisterPasses.cpp
312–319 ↗	(On Diff #96395)	Now that `createPPCGCodeGenerationPass` takes an argument, you don't need a switch anymore.
tools/GPURuntime/GPUJIT.c
303–311 ↗	(On Diff #96620)	Consistent variable name style? What style do you intend to use in this file?
347–348 ↗	(On Diff #96620)	Replace the magic number 256 by `sizeof(DeviceRevision)`?

etherzhhb added inline comments.Apr 25 2017, 5:12 PM

include/polly/LinkAllPasses.h
51 ↗	(On Diff #96620)	is this Runtime supposed to be with type GPURuntimeT ? it is a little bit tricky here. Maybe we need to introduce a PPCG header and define the runtime enum there, than include that runtime enum. or we can declare the function as llvm::Pass *createPPCGCodeGenerationPass(int Runtime = 0); to at least avoid the magic number 0 in line 86.
lib/CodeGen/PPCGCodeGeneration.cpp
1562 ↗	(On Diff #96395)	for opencl, it can be "spir-unknown-unknown" or "spir64-unknown-unknown", but that may not work :)

Looking into the rest of your comments.

include/polly/LinkAllPasses.h
51 ↗	(On Diff #96620)	Yes, it would be. The reason it's not is exactly the one you mentioned. I was considering adding a PPCG header, but refrained from it because I was hesitant about creating a header 'just for one enum'. If you agree that this is a good solution, I will indeed introduce a new header for PPCG and define the enum there, to get rid of magic numbers. The second option seems reasonable too though.
lib/CodeGen/PPCGCodeGeneration.cpp
1562 ↗	(On Diff #96395)	Looking into it. The next goal would be to add the AMDGPU backend to generate AMD ISA, which would then again utilize the same OpenCL Runtime implemented here. (I realize there will have to be some naming changes to make that clear in the `GPUJIT`, but as you pointed out, I have a naming-mess to fix there anyway.

etherzhhb added inline comments.Apr 26 2017, 12:12 AM

include/polly/LinkAllPasses.h
51 ↗	(On Diff #96620)	we could start from the second option if you think it is reasonable

Addressed consistency and naming concerns

PhilippSchaad marked 7 inline comments as done.Apr 26 2017, 3:15 AM

PhilippSchaad edited the summary of this revision. (Show Details)Apr 26 2017, 3:19 AM

Made CUDA Runtime default, fixed formatting, adapted test case

Hi Philip and others,

this already looks very cool. I also added some minor comments.

Best,
Tobias

lib/CodeGen/PPCGCodeGeneration.cpp
56–58 ↗	(On Diff #96395)	You can use C++11 enums ala enum class GPURuntime { CUDA, OpenCL };
1562 ↗	(On Diff #96395)	Making OpenCL work for CUDA is just the first step. I expect that when adding AMDGPU support, we will use here different triples depending on which vendor to target. AMD will have a specific one, CUDA will have a specific one, and for Intel we likely use the generic SPIR-V comment. I assume this could then also work for Xilinx.

Fixed enum style to C++11

PhilippSchaad marked an inline comment as done.Apr 27 2017, 5:03 AM

Meinersbur added inline comments.Apr 27 2017, 6:55 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1562 ↗	(On Diff #96395)	At compile time, we don't know on which hardware it will run on, so we cannot specify a triple here. Unless you think of a runtime dispatch system, then you need to generate all kernels at once. In that case, I still would like to select a single target only for when I know I will run only on that hardware and to keep the executable small.

PhilippSchaad added inline comments.Apr 27 2017, 7:49 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1562 ↗	(On Diff #96395)	I thought the goal was to let the user compile for a specific target, i.e. providing something like -polly-gpu-arch=amd/nvidia/intel, and then choosing the correct target triple according to said selection. Meaning for example -polly-gpu-arch=amd would utilize the AMDGPU backend triple and feed that into the OpenCL runtime. Am I misunderstanding something?

Meinersbur added inline comments.Apr 27 2017, 8:48 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1562 ↗	(On Diff #96395)	I think we were miscommunicating. The -polly-gpu-arch switch is new to me and doesn't appear in this patch. I assumed a fat executable when you mentioned an AMD backend. OpenCL claims to be hardware-independent with platform's driver translating OpenCL-C or SPIR(-V) to its proprietary format. In NVidia's terminology, CUDA is a platform of which CUDA C++, their OpenCL implementation, cudart (CUDA runtime) etc. are part of. We are still vendor-locked to CUDA, since it only works with CUDA's OpenCL runtime library. -polly-gpu-runtime=opencl therefore is misleading (at least it was to me), it it no alternative to CUDA. It might resolve if it is indeed just the runtime library GPUJIT is linked to. If so, could you make it more clear? I suggest the following switches: -polly-target=cpu/gpu if -polly-target=gpu then -polly-gpu-arch=nvptx64/hsa/spir/spir-v/opencl-c (with -polly-gpu-arch=nvptx64 the only one implemented so far) if -polly-gpu-arch=nvptx64 then there is a choice between -polly-cuda-runtime=libcudart/libopencl

GPURuntime works on systems with just one of CUDA/OpenCL now.

Harbormaster completed remote builds in B6005: Diff 97189.Apr 29 2017, 6:38 AM

PhilippSchaad added inline comments.Apr 29 2017, 6:40 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1562 ↗	(On Diff #96395)	This change should address exactly this. The framework is now set to introduce new architectures and utilize eg. the AMDGPU backend instead of NVPTX etc.

singam-sanjay added a subscriber: singam-sanjay.Apr 29 2017, 7:37 AM

singam-sanjay added inline comments.

lib/CodeGen/PPCGCodeGeneration.cpp
1623 ↗	(On Diff #97189)	Does `nvptx64-nvidia-nvcl` mean OpenCL code meant to be run on NVIDIA GPUs ?

PhilippSchaad added inline comments.Apr 29 2017, 7:40 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1623 ↗	(On Diff #97189)	Yes, exactly. It generates a slightly different flavor of PTX, which can be used by OpenCL to generate a kernel from the PTX binary (on NVIDIA GPUs). If you were to use the standard CUDA PTX, OpenCL would complain because of wrong argument accesses.

PhilippSchaad edited the summary of this revision. (Show Details)Apr 29 2017, 8:14 AM

PhilippSchaad set the repository for this revision to rL LLVM.

singam-sanjay added inline comments.Apr 29 2017, 9:33 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1623 ↗	(On Diff #97189)	Okay. From what you're saying, `nvptx64-nvidia-nvcl` indicates that backend must generate NVPTX code for a 64bit architecture for an NVIDIA GPU controlled by a OpenCL driver. Please correct me if I'm wrong.

PhilippSchaad added inline comments.Apr 29 2017, 9:37 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1623 ↗	(On Diff #97189)	That is correct.

singam-sanjay added inline comments.Apr 29 2017, 11:25 PM

lib/CodeGen/PPCGCodeGeneration.cpp
1623 ↗	(On Diff #97189)	Thank you ! That was helpful.

PhilippSchaad added a reviewer: singam-sanjay.Apr 30 2017, 3:42 AM

Integrated D32226 - Managed memory support

@grosser @Meinersbur ping

Fixed formatting and managed-memory test case (including pre-existing bug)

I only consider the clSetKernelArg as a remaining bigger issue. Having only "polly_"-prefixed function non-static would also be great.

Tobias is the sole author of Polly-ACC. I think he should give the final LGTM.

include/polly/LinkAllPasses.h
51 ↗	(On Diff #96620)	I assume you kept the arguments of type `int` to not include header files here.
lib/CodeGen/PPCGCodeGeneration.cpp
769–772 ↗	(On Diff #97436)	Can you make this a switch so you get warned by the compiler when adding more runtimes?
1726–1729 ↗	(On Diff #97436)	Could also be a switch.
2689 ↗	(On Diff #97436)	We should check whether the arguments are valid. Such as: switch (Runtime) { case 1: case 2: ... default: llvm_unreachable("Invalid argument for Runtime"); }
2691–2695 ↗	(On Diff #97436)	Similarly: switch (Arch) { case 1: ... default: llvm_unreachable("Invalid argument for Arch"); }
lib/Support/RegisterPasses.cpp
331–348 ↗	(On Diff #97436)	int Arch; switch (GPUArch) { case GPU_ARCH_NVPTX64; Arch = 1; break; } int Runtime; switch (GPURuntime) { case GPU_RUNTIME_CUDA: Runtime = 1; break; case GPU_RUNTIME_OPENCL: Runtime = 2; breal } PM.add(polly::createPPCGCodeGenerationPass(Arch, Runtime)) With "you don't need a switch anymore", I was thinking about the like of: PM.add(polly::createPPCGCodeGenerationPass(1, GPURuntime + 1)); Your choice. It could be helpful to have `createPPCGCodeGenerationPass` accept the GPURuntime enum as arguments instead. In your solution, I don't see the use of `static const int` local variables. If you want identifiers that give names to the accepted arguments, declare them as `#define` or `static const int` in the header file that also declares `createPPCGCodeGenerationPass`, so these can be used in the implementation of `createPPCGCodeGenerationPass` as well.
tools/GPURuntime/GPUJIT.c
397–401 ↗	(On Diff #97436)	Did you consider introducing a new function this sequence of code? It appears quite often.
610–618 ↗	(On Diff #97436)	Trying each argument size after the other and hoping one matches is not good. The caller must know the argument sizes. You probably have to pass the sizes in another argument to `launchKernelCL` that contains those sizes for each argument, generated by Polly. Without this, the code will fail if you pass a struct (or vector) of size other than 8, 4, 2, or 1.
676 ↗	(On Diff #97436)	Shouldn't these print to `stderr`?
750 ↗	(On Diff #97436)	The function name does not follow the naming of other functions in this file. In C it is common have the public API functions prefixed with the library name (here: "polly") and everything else static. Don't choose the prefix of another library (here: "cl_"). This avoids symbol conflicts because multiple libraries happen to give the same name for a function.

PhilippSchaad marked 17 inline comments as done.May 2 2017, 1:13 PM

PhilippSchaad added inline comments.

include/polly/LinkAllPasses.h
51 ↗	(On Diff #96620)	That is correct, yes.
lib/Support/RegisterPasses.cpp
331–348 ↗	(On Diff #97436)	With "you don't need a switch anymore", I was thinking about the like of: PM.add(polly::createPPCGCodeGenerationPass(1, GPURuntime + 1)); Your choice. Personally, I think the current solution is tiny little bit more 'documenting'. Both is fine though, good call. It could be helpful to have createPPCGCodeGenerationPass accept the GPURuntime enum as arguments instead. That is true, would mean having to provide a header with that GPURuntime enum instead though, right? In your solution, I don't see the use of static const int local variables. If you want identifiers that give names to the accepted arguments, declare them as #define or static const int in the header file that also declares createPPCGCodeGenerationPass, so these can be used in the implementation of createPPCGCodeGenerationPass as well. Would it maybe make sense to introduce a PPCGCodeGeneration header at this point?
tools/GPURuntime/GPUJIT.c
610–618 ↗	(On Diff #97436)	Yes, this is a priority issue still. The issue will have to be resolved at some point. This is basically a temporary way around some (probably) major argument handling changes in PPCG etc.

Addressed multiple issues pointed out in comment

Fixed formatting

You changed stdout to stderr everywhere, which is better in my point of view, but logically is a different change. Sorry that I didn't realize that the libcudart also printed to stdout before so you tried to be consistent. Could you commit that change separately beforehand? (Maybe also the change in argument name capitalization)

I am accepting the patch proved that you are going to improve the clSetKernelArg situation later and add a TODO into the code about it. The other stuff is of stylistic nature only.

Please also wait for Tobias' approval.

lib/Support/RegisterPasses.cpp
331–348 ↗	(On Diff #97436)	Please remove at least the `static` keyword. It makes sense for global constants, but not for function-local ones. The style static const int ArgumentName = 0; func(ArgumentName); rather unusual in LLVM-style code (but not bad if applied consistenly, which is unfortunately not the case in Polly). I've seen func(/* ArgumentName = */ 0); much more often. In this case I think `UseOpenCLRuntime`, `UseCUDARuntime` and `TargetNVPTX64` should really be global constants declated close to the declaration of `createPPCGCodeGenerationPass` so it can be used by ever caller of that function. Would it maybe make sense to introduce a PPCGCodeGeneration header at this point? Yes, that sounds good to me as well.
tools/GPURuntime/GPUJIT.c
606–607 ↗	(On Diff #97492)	Thanks for the introduction of `checkOpenCLError`. You could also introduce one for these two lines. for instance: if (!GlobalContext) handleError("GPGPU-code generation not correctly initialized.\n"); `handleError` could also be called by `checkOpenCLError`. It helps centralising the error handling, such that if we change some detail about it (e.g. the return code on exit, or some cleanup code), there is a single function for that.
958–963 ↗	(On Diff #97492)	These are unrelated changes Tobias usually complains about. I personally don't care.
1098 ↗	(On Diff #97492)	Unrelated whitespace change?
610–618 ↗	(On Diff #97436)	Also note that I am not sure that OpenCL ICD's are required to check for correct `CL_INVALID_ARG_SIZE`. It might just trust the caller, or be a badly written one.

This revision is now accepted and ready to land.May 4 2017, 2:41 AM

Addressed most of your concerns. @grosser it should be ready now, what do you think?

Introduced PPCGCodeGeneration header file for simplicity

Harbormaster completed remote builds in B6145: Diff 97804.May 4 2017, 3:45 AM

@Meinersbur the unrelated changes you mentioned have been added/moved to D32852 and D32854.

grosser accepted this revision.May 4 2017, 3:52 AM

grosser added inline comments.

test/GPGPU/cuda-managed-memory-simple.ll
49 ↗	(On Diff #97804)	This change is unrelated.

PhilippSchaad added inline comments.May 4 2017, 3:55 AM

test/GPGPU/cuda-managed-memory-simple.ll
49 ↗	(On Diff #97804)	It is, but it got fixed in the meantime anyway. Removing it.

Closed by commit rL302215: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen (authored by bollu). · Explain WhyMay 5 2017, 1:08 AM

This revision was automatically updated to reflect the committed changes.

bollu mentioned this in rL302217: Revert "[Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen".May 5 2017, 2:15 AM

Reopened for rebase

This revision is now accepted and ready to land.May 7 2017, 2:36 AM

Rebase

Closed by commit rL302379: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen (authored by bollu). · Explain WhyMay 7 2017, 2:17 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

polly/

trunk/

CMakeLists.txt

10 lines

include/

polly/

CodeGen/

PPCGCodeGeneration.h

24 lines

LinkAllPasses.h

4 lines

lib/

CodeGen/

PPCGCodeGeneration.cpp

113 lines

Support/

RegisterPasses.cpp

21 lines

test/

GPGPU/

cuda-managed-memory-simple.ll

4 lines

size-cast.ll

2 lines

tools/

CMakeLists.txt

4 lines

GPURuntime/

GPUJIT.h

19 lines

GPUJIT.c

1317 lines

Diff 98109

polly/trunk/CMakeLists.txt

	Show First 20 Lines • Show All 146 Lines • ▼ Show 20 Lines

	# Add path for custom modules			# Add path for custom modules
	set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${POLLY_SOURCE_DIR}/cmake")			set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${POLLY_SOURCE_DIR}/cmake")

	SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)			SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)

	option(POLLY_ENABLE_GPGPU_CODEGEN "Enable GPGPU code generation feature" OFF)			option(POLLY_ENABLE_GPGPU_CODEGEN "Enable GPGPU code generation feature" OFF)
	if (POLLY_ENABLE_GPGPU_CODEGEN)			if (POLLY_ENABLE_GPGPU_CODEGEN)
	# Do not require CUDA, as GPU code generation test cases can be run without			# Do not require CUDA/OpenCL, as GPU code generation test cases can be run
	# a cuda library.			# without a CUDA/OpenCL library.
	FIND_PACKAGE(CUDA)			FIND_PACKAGE(CUDA)
				FIND_PACKAGE(OpenCL)
	set(GPU_CODEGEN TRUE)			set(GPU_CODEGEN TRUE)
	else(POLLY_ENABLE_GPGPU_CODEGEN)			else(POLLY_ENABLE_GPGPU_CODEGEN)
	set(GPU_CODEGEN FALSE)			set(GPU_CODEGEN FALSE)
	endif(POLLY_ENABLE_GPGPU_CODEGEN)			endif(POLLY_ENABLE_GPGPU_CODEGEN)


	# Support GPGPU code generation if the library is available.			# Support GPGPU code generation if the library is available.
	if (CUDALIB_FOUND)			if (CUDALIB_FOUND)
				add_definitions(-DHAS_LIBCUDART)
	INCLUDE_DIRECTORIES( ${CUDALIB_INCLUDE_DIR} )			INCLUDE_DIRECTORIES( ${CUDALIB_INCLUDE_DIR} )
	endif(CUDALIB_FOUND)			endif(CUDALIB_FOUND)
				if (OpenCL_FOUND)
				add_definitions(-DHAS_LIBOPENCL)
				INCLUDE_DIRECTORIES( ${OpenCL_INCLUDE_DIR} )
				endif(OpenCL_FOUND)

	option(POLLY_BUNDLED_ISL "Use the bundled version of libisl included in Polly" ON)			option(POLLY_BUNDLED_ISL "Use the bundled version of libisl included in Polly" ON)
	if (NOT POLLY_BUNDLED_ISL)			if (NOT POLLY_BUNDLED_ISL)
	find_package(ISL MODULE REQUIRED)			find_package(ISL MODULE REQUIRED)
	message(STATUS "Using external libisl ${ISL_VERSION} in: ${ISL_PREFIX}")			message(STATUS "Using external libisl ${ISL_VERSION} in: ${ISL_PREFIX}")
	set(ISL_TARGET ISL)			set(ISL_TARGET ISL)
	else()			else()
	set(ISL_INCLUDE_DIRS			set(ISL_INCLUDE_DIRS
	▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

polly/trunk/include/polly/CodeGen/PPCGCodeGeneration.h

				//===--- polly/PPCGCodeGeneration.h - Polly Accelerator Code Generation. --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// Take a scop created by ScopInfo and map it to GPU code using the ppcg
				// GPU mapping strategy.
				//
				//===----------------------------------------------------------------------===//

				#ifndef POLLY_PPCGCODEGENERATION_H
				#define POLLY_PPCGCODEGENERATION_H

				/// The GPU Architecture to target.
				enum GPUArch { NVPTX64 };

				/// The GPU Runtime implementation to use.
				enum GPURuntime { CUDA, OpenCL };

				#endif // POLLY_PPCGCODEGENERATION_H

polly/trunk/include/polly/LinkAllPasses.h

	Show All 9 Lines
	// This header file pulls in all transformation and analysis passes for tools			// This header file pulls in all transformation and analysis passes for tools
	// like opt and bugpoint that need this functionality.			// like opt and bugpoint that need this functionality.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef POLLY_LINKALLPASSES_H			#ifndef POLLY_LINKALLPASSES_H
	#define POLLY_LINKALLPASSES_H			#define POLLY_LINKALLPASSES_H

				#include "polly/CodeGen/PPCGCodeGeneration.h"
	#include "polly/Config/config.h"			#include "polly/Config/config.h"
	#include "polly/PruneUnprofitable.h"			#include "polly/PruneUnprofitable.h"
	#include "polly/Simplify.h"			#include "polly/Simplify.h"
	#include "polly/Support/DumpModulePass.h"			#include "polly/Support/DumpModulePass.h"
	#include "llvm/ADT/StringRef.h"			#include "llvm/ADT/StringRef.h"
	#include <cstdlib>			#include <cstdlib>

	namespace llvm {			namespace llvm {
	Show All 17 Lines
	llvm::Pass *createPollyCanonicalizePass();			llvm::Pass *createPollyCanonicalizePass();
	llvm::Pass *createPolyhedralInfoPass();			llvm::Pass *createPolyhedralInfoPass();
	llvm::Pass *createScopDetectionPass();			llvm::Pass *createScopDetectionPass();
	llvm::Pass *createScopInfoRegionPassPass();			llvm::Pass *createScopInfoRegionPassPass();
	llvm::Pass *createScopInfoWrapperPassPass();			llvm::Pass *createScopInfoWrapperPassPass();
	llvm::Pass *createIslAstInfoPass();			llvm::Pass *createIslAstInfoPass();
	llvm::Pass *createCodeGenerationPass();			llvm::Pass *createCodeGenerationPass();
	#ifdef GPU_CODEGEN			#ifdef GPU_CODEGEN
	llvm::Pass *createPPCGCodeGenerationPass();			llvm::Pass *createPPCGCodeGenerationPass(GPUArch Arch = GPUArch::NVPTX64,
				GPURuntime Runtime = GPURuntime::CUDA);
	#endif			#endif
	llvm::Pass *createIslScheduleOptimizerPass();			llvm::Pass *createIslScheduleOptimizerPass();
	llvm::Pass *createFlattenSchedulePass();			llvm::Pass *createFlattenSchedulePass();
	llvm::Pass *createDeLICMPass();			llvm::Pass *createDeLICMPass();

	extern char &CodePreparationID;			extern char &CodePreparationID;
	} // namespace polly			} // namespace polly

	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

polly/trunk/lib/CodeGen/PPCGCodeGeneration.cpp

//===------ PPCGCodeGeneration.cpp - Polly Accelerator Code Generation. ---===//		//===------ PPCGCodeGeneration.cpp - Polly Accelerator Code Generation. ---===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Take a scop created by ScopInfo and map it to GPU code using the ppcg		// Take a scop created by ScopInfo and map it to GPU code using the ppcg
// GPU mapping strategy.		// GPU mapping strategy.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include "polly/CodeGen/PPCGCodeGeneration.h"
#include "polly/CodeGen/IslAst.h"		#include "polly/CodeGen/IslAst.h"
#include "polly/CodeGen/IslNodeBuilder.h"		#include "polly/CodeGen/IslNodeBuilder.h"
#include "polly/CodeGen/Utils.h"		#include "polly/CodeGen/Utils.h"
#include "polly/DependenceInfo.h"		#include "polly/DependenceInfo.h"
#include "polly/LinkAllPasses.h"		#include "polly/LinkAllPasses.h"
#include "polly/Options.h"		#include "polly/Options.h"
#include "polly/ScopDetection.h"		#include "polly/ScopDetection.h"
#include "polly/ScopInfo.h"		#include "polly/ScopInfo.h"
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines
/// for generating GPU specific user nodes.		/// for generating GPU specific user nodes.
///		///
/// @see GPUNodeBuilder::createUser		/// @see GPUNodeBuilder::createUser
class GPUNodeBuilder : public IslNodeBuilder {		class GPUNodeBuilder : public IslNodeBuilder {
public:		public:
GPUNodeBuilder(PollyIRBuilder &Builder, ScopAnnotator &Annotator,		GPUNodeBuilder(PollyIRBuilder &Builder, ScopAnnotator &Annotator,
const DataLayout &DL, LoopInfo &LI, ScalarEvolution &SE,		const DataLayout &DL, LoopInfo &LI, ScalarEvolution &SE,
DominatorTree &DT, Scop &S, BasicBlock *StartBlock,		DominatorTree &DT, Scop &S, BasicBlock *StartBlock,
gpu_prog *Prog)		gpu_prog *Prog, GPURuntime Runtime, GPUArch Arch)
: IslNodeBuilder(Builder, Annotator, DL, LI, SE, DT, S, StartBlock),		: IslNodeBuilder(Builder, Annotator, DL, LI, SE, DT, S, StartBlock),
Prog(Prog) {		Prog(Prog), Runtime(Runtime), Arch(Arch) {
getExprBuilder().setIDToSAI(&IDToSAI);		getExprBuilder().setIDToSAI(&IDToSAI);
}		}

/// Create after-run-time-check initialization code.		/// Create after-run-time-check initialization code.
void initializeAfterRTH();		void initializeAfterRTH();

/// Finalize the generated scop.		/// Finalize the generated scop.
virtual void finalize();		virtual void finalize();
Show All 29 Lines	private:
/// A module containing GPU code.		/// A module containing GPU code.
///		///
/// This pointer is only set in case we are currently generating GPU code.		/// This pointer is only set in case we are currently generating GPU code.
std::unique_ptr<Module> GPUModule;		std::unique_ptr<Module> GPUModule;

/// The GPU program we generate code for.		/// The GPU program we generate code for.
gpu_prog *Prog;		gpu_prog *Prog;

		/// The GPU Runtime implementation to use (OpenCL or CUDA).
		GPURuntime Runtime;

		/// The GPU Architecture to target.
		GPUArch Arch;

/// Class to free isl_ids.		/// Class to free isl_ids.
class IslIdDeleter {		class IslIdDeleter {
public:		public:
void operator()(__isl_take isl_id *Id) { isl_id_free(Id); };		void operator()(__isl_take isl_id *Id) { isl_id_free(Id); };
};		};

/// A set containing all isl_ids allocated in a GPU kernel.		/// A set containing all isl_ids allocated in a GPU kernel.
///		///
▲ Show 20 Lines • Show All 535 Lines • ▼ Show 20 Lines	if (!F) {
FunctionType *Ty = FunctionType::get(Builder.getVoidTy(), false);		FunctionType *Ty = FunctionType::get(Builder.getVoidTy(), false);
F = Function::Create(Ty, Linkage, Name, M);		F = Function::Create(Ty, Linkage, Name, M);
}		}

Builder.CreateCall(F);		Builder.CreateCall(F);
}		}

Value *GPUNodeBuilder::createCallInitContext() {		Value *GPUNodeBuilder::createCallInitContext() {
const char *Name = "polly_initContext";		const char *Name;

		switch (Runtime) {
		case GPURuntime::CUDA:
		Name = "polly_initContextCUDA";
		break;
		case GPURuntime::OpenCL:
		Name = "polly_initContextCL";
		break;
		}

Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
Function *F = M->getFunction(Name);		Function *F = M->getFunction(Name);

// If F is not available, declare it.		// If F is not available, declare it.
if (!F) {		if (!F) {
GlobalValue::LinkageTypes Linkage = Function::ExternalLinkage;		GlobalValue::LinkageTypes Linkage = Function::ExternalLinkage;
std::vector<Type *> Args;		std::vector<Type *> Args;
FunctionType *Ty = FunctionType::get(Builder.getInt8PtrTy(), Args, false);		FunctionType *Ty = FunctionType::get(Builder.getInt8PtrTy(), Args, false);
▲ Show 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	void GPUNodeBuilder::createScopStmt(isl_ast_expr *Expr,
if (Stmt->isBlockStmt())		if (Stmt->isBlockStmt())
BlockGen.copyStmt(*Stmt, LTS, Indexes);		BlockGen.copyStmt(*Stmt, LTS, Indexes);
else		else
RegionGen.copyStmt(*Stmt, LTS, Indexes);		RegionGen.copyStmt(*Stmt, LTS, Indexes);
}		}

void GPUNodeBuilder::createKernelSync() {		void GPUNodeBuilder::createKernelSync() {
Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
auto *Sync = Intrinsic::getDeclaration(M, Intrinsic::nvvm_barrier0);
		Function *Sync;

		switch (Arch) {
		case GPUArch::NVPTX64:
		Sync = Intrinsic::getDeclaration(M, Intrinsic::nvvm_barrier0);
		break;
		}

Builder.CreateCall(Sync, {});		Builder.CreateCall(Sync, {});
}		}

/// Collect llvm::Values referenced from @p Node		/// Collect llvm::Values referenced from @p Node
///		///
/// This function only applies to isl_ast_nodes that are user_nodes referring		/// This function only applies to isl_ast_nodes that are user_nodes referring
/// to a ScopStmt. All other node types are ignore.		/// to a ScopStmt. All other node types are ignore.
///		///
▲ Show 20 Lines • Show All 389 Lines • ▼ Show 20 Lines	GPUNodeBuilder::createKernelFunctionDecl(ppcg_kernel *Kernel,
}		}

for (auto *V : SubtreeValues)		for (auto *V : SubtreeValues)
Args.push_back(V->getType());		Args.push_back(V->getType());

auto *FT = FunctionType::get(Builder.getVoidTy(), Args, false);		auto *FT = FunctionType::get(Builder.getVoidTy(), Args, false);
auto *FN = Function::Create(FT, Function::ExternalLinkage, Identifier,		auto *FN = Function::Create(FT, Function::ExternalLinkage, Identifier,
GPUModule.get());		GPUModule.get());

		switch (Arch) {
		case GPUArch::NVPTX64:
FN->setCallingConv(CallingConv::PTX_Kernel);		FN->setCallingConv(CallingConv::PTX_Kernel);
		break;
		}

auto Arg = FN->arg_begin();		auto Arg = FN->arg_begin();
for (long i = 0; i < Kernel->n_array; i++) {		for (long i = 0; i < Kernel->n_array; i++) {
if (!ppcg_kernel_requires_array_argument(Kernel, i))		if (!ppcg_kernel_requires_array_argument(Kernel, i))
continue;		continue;

Arg->setName(Kernel->array[i].array->name);		Arg->setName(Kernel->array[i].array->name);

▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	for (auto *V : SubtreeValues) {
ValueMap[V] = &*Arg;		ValueMap[V] = &*Arg;
Arg++;		Arg++;
}		}

return FN;		return FN;
}		}

void GPUNodeBuilder::insertKernelIntrinsics(ppcg_kernel *Kernel) {		void GPUNodeBuilder::insertKernelIntrinsics(ppcg_kernel *Kernel) {
Intrinsic::ID IntrinsicsBID[] = {Intrinsic::nvvm_read_ptx_sreg_ctaid_x,		Intrinsic::ID IntrinsicsBID[2];
Intrinsic::nvvm_read_ptx_sreg_ctaid_y};		Intrinsic::ID IntrinsicsTID[3];

Intrinsic::ID IntrinsicsTID[] = {Intrinsic::nvvm_read_ptx_sreg_tid_x,		switch (Arch) {
Intrinsic::nvvm_read_ptx_sreg_tid_y,		case GPUArch::NVPTX64:
Intrinsic::nvvm_read_ptx_sreg_tid_z};		IntrinsicsBID[0] = Intrinsic::nvvm_read_ptx_sreg_ctaid_x;
		IntrinsicsBID[1] = Intrinsic::nvvm_read_ptx_sreg_ctaid_y;

		IntrinsicsTID[0] = Intrinsic::nvvm_read_ptx_sreg_tid_x;
		IntrinsicsTID[1] = Intrinsic::nvvm_read_ptx_sreg_tid_y;
		IntrinsicsTID[2] = Intrinsic::nvvm_read_ptx_sreg_tid_z;
		break;
		}

auto addId = [this](__isl_take isl_id *Id, Intrinsic::ID Intr) mutable {		auto addId = [this](__isl_take isl_id *Id, Intrinsic::ID Intr) mutable {
std::string Name = isl_id_get_name(Id);		std::string Name = isl_id_get_name(Id);
Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
Function *IntrinsicFn = Intrinsic::getDeclaration(M, Intr);		Function *IntrinsicFn = Intrinsic::getDeclaration(M, Intr);
Value *Val = Builder.CreateCall(IntrinsicFn, {});		Value *Val = Builder.CreateCall(IntrinsicFn, {});
Val = Builder.CreateIntCast(Val, Builder.getInt64Ty(), false, Name);		Val = Builder.CreateIntCast(Val, Builder.getInt64Ty(), false, Name);
IDToValue[Id] = Val;		IDToValue[Id] = Val;
▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	for (int i = 0; i < Kernel->n_var; ++i) {
LocalArrays.push_back(Allocation);		LocalArrays.push_back(Allocation);
KernelIds.push_back(Id);		KernelIds.push_back(Id);
IDToSAI[Id] = SAI;		IDToSAI[Id] = SAI;
}		}
}		}

void GPUNodeBuilder::createKernelFunction(ppcg_kernel *Kernel,		void GPUNodeBuilder::createKernelFunction(ppcg_kernel *Kernel,
SetVector<Value *> &SubtreeValues) {		SetVector<Value *> &SubtreeValues) {

std::string Identifier = "kernel_" + std::to_string(Kernel->id);		std::string Identifier = "kernel_" + std::to_string(Kernel->id);
GPUModule.reset(new Module(Identifier, Builder.getContext()));		GPUModule.reset(new Module(Identifier, Builder.getContext()));

		switch (Arch) {
		case GPUArch::NVPTX64:
		if (Runtime == GPURuntime::CUDA)
GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-cuda"));		GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-cuda"));
		else if (Runtime == GPURuntime::OpenCL)
		GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-nvcl"));
GPUModule->setDataLayout(computeNVPTXDataLayout(true /* is64Bit */));		GPUModule->setDataLayout(computeNVPTXDataLayout(true /* is64Bit */));
		break;
		}

Function *FN = createKernelFunctionDecl(Kernel, SubtreeValues);		Function *FN = createKernelFunctionDecl(Kernel, SubtreeValues);

BasicBlock *PrevBlock = Builder.GetInsertBlock();		BasicBlock *PrevBlock = Builder.GetInsertBlock();
auto EntryBlock = BasicBlock::Create(Builder.getContext(), "entry", FN);		auto EntryBlock = BasicBlock::Create(Builder.getContext(), "entry", FN);

DT.addNewBlock(EntryBlock, PrevBlock);		DT.addNewBlock(EntryBlock, PrevBlock);

Builder.SetInsertPoint(EntryBlock);		Builder.SetInsertPoint(EntryBlock);
Builder.CreateRetVoid();		Builder.CreateRetVoid();
Builder.SetInsertPoint(EntryBlock, EntryBlock->begin());		Builder.SetInsertPoint(EntryBlock, EntryBlock->begin());

ScopDetection::markFunctionAsInvalid(FN);		ScopDetection::markFunctionAsInvalid(FN);

prepareKernelArguments(Kernel, FN);		prepareKernelArguments(Kernel, FN);
createKernelVariables(Kernel, FN);		createKernelVariables(Kernel, FN);
insertKernelIntrinsics(Kernel);		insertKernelIntrinsics(Kernel);
}		}

std::string GPUNodeBuilder::createKernelASM() {		std::string GPUNodeBuilder::createKernelASM() {
llvm::Triple GPUTriple(Triple::normalize("nvptx64-nvidia-cuda"));		llvm::Triple GPUTriple;

		switch (Arch) {
		case GPUArch::NVPTX64:
		switch (Runtime) {
		case GPURuntime::CUDA:
		GPUTriple = llvm::Triple(Triple::normalize("nvptx64-nvidia-cuda"));
		break;
		case GPURuntime::OpenCL:
		GPUTriple = llvm::Triple(Triple::normalize("nvptx64-nvidia-nvcl"));
		break;
		}
		break;
		}

std::string ErrMsg;		std::string ErrMsg;
auto GPUTarget = TargetRegistry::lookupTarget(GPUTriple.getTriple(), ErrMsg);		auto GPUTarget = TargetRegistry::lookupTarget(GPUTriple.getTriple(), ErrMsg);

if (!GPUTarget) {		if (!GPUTarget) {
errs() << ErrMsg << "\n";		errs() << ErrMsg << "\n";
return "";		return "";
}		}

TargetOptions Options;		TargetOptions Options;
Options.UnsafeFPMath = FastMath;		Options.UnsafeFPMath = FastMath;
std::unique_ptr<TargetMachine> TargetM(
GPUTarget->createTargetMachine(GPUTriple.getTriple(), CudaVersion, "",		std::string subtarget;
Options, Optional<Reloc::Model>()));
		switch (Arch) {
		case GPUArch::NVPTX64:
		subtarget = CudaVersion;
		break;
		}

		std::unique_ptr<TargetMachine> TargetM(GPUTarget->createTargetMachine(
		GPUTriple.getTriple(), subtarget, "", Options, Optional<Reloc::Model>()));

SmallString<0> ASMString;		SmallString<0> ASMString;
raw_svector_ostream ASMStream(ASMString);		raw_svector_ostream ASMStream(ASMString);
llvm::legacy::PassManager PM;		llvm::legacy::PassManager PM;

PM.add(createTargetTransformInfoWrapperPass(TargetM->getTargetIRAnalysis()));		PM.add(createTargetTransformInfoWrapperPass(TargetM->getTargetIRAnalysis()));

if (TargetM->addPassesToEmitFile(		if (TargetM->addPassesToEmitFile(
Show All 35 Lines	std::string GPUNodeBuilder::finalizeKernelFunction() {
return Assembly;		return Assembly;
}		}

namespace {		namespace {
class PPCGCodeGeneration : public ScopPass {		class PPCGCodeGeneration : public ScopPass {
public:		public:
static char ID;		static char ID;

		GPURuntime Runtime = GPURuntime::CUDA;

		GPUArch Architecture = GPUArch::NVPTX64;

/// The scop that is currently processed.		/// The scop that is currently processed.
Scop *S;		Scop *S;

LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
ScalarEvolution *SE;		ScalarEvolution *SE;
const DataLayout *DL;		const DataLayout *DL;
RegionInfo *RI;		RegionInfo *RI;
▲ Show 20 Lines • Show All 767 Lines • ▼ Show 20 Lines	void generateCode(__isl_take isl_ast_node Root, gpu_prog Prog) {
// branch will guard the original scop from new induction variables that		// branch will guard the original scop from new induction variables that
// the SCEVExpander may introduce while code generating the parameters and		// the SCEVExpander may introduce while code generating the parameters and
// which may introduce scalar dependences that prevent us from correctly		// which may introduce scalar dependences that prevent us from correctly
// code generating this scop.		// code generating this scop.
BasicBlock *StartBlock =		BasicBlock *StartBlock =
executeScopConditionally(S, Builder.getTrue(), DT, RI, LI);		executeScopConditionally(S, Builder.getTrue(), DT, RI, LI);

GPUNodeBuilder NodeBuilder(Builder, Annotator, DL, LI, SE, DT, *S,		GPUNodeBuilder NodeBuilder(Builder, Annotator, DL, LI, SE, DT, *S,
StartBlock, Prog);		StartBlock, Prog, Runtime, Architecture);

// TODO: Handle LICM		// TODO: Handle LICM
auto SplitBlock = StartBlock->getSinglePredecessor();		auto SplitBlock = StartBlock->getSinglePredecessor();
Builder.SetInsertPoint(SplitBlock->getTerminator());		Builder.SetInsertPoint(SplitBlock->getTerminator());
NodeBuilder.addParameters(S->getContext());		NodeBuilder.addParameters(S->getContext());

isl_ast_build *Build = isl_ast_build_alloc(S->getIslCtx());		isl_ast_build *Build = isl_ast_build_alloc(S->getIslCtx());
isl_ast_expr *Condition = IslAst::buildRunCondition(S, Build);		isl_ast_expr *Condition = IslAst::buildRunCondition(S, Build);
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addPreserved<RegionInfoPass>();		AU.addPreserved<RegionInfoPass>();
AU.addPreserved<ScopInfoRegionPass>();		AU.addPreserved<ScopInfoRegionPass>();
}		}
};		};
} // namespace		} // namespace

char PPCGCodeGeneration::ID = 1;		char PPCGCodeGeneration::ID = 1;

Pass *polly::createPPCGCodeGenerationPass() { return new PPCGCodeGeneration(); }		Pass *polly::createPPCGCodeGenerationPass(GPUArch Arch, GPURuntime Runtime) {
		PPCGCodeGeneration *generator = new PPCGCodeGeneration();
		generator->Runtime = Runtime;
		generator->Architecture = Arch;
		return generator;
		}

INITIALIZE_PASS_BEGIN(PPCGCodeGeneration, "polly-codegen-ppcg",		INITIALIZE_PASS_BEGIN(PPCGCodeGeneration, "polly-codegen-ppcg",
"Polly - Apply PPCG translation to SCOP", false, false)		"Polly - Apply PPCG translation to SCOP", false, false)
INITIALIZE_PASS_DEPENDENCY(DependenceInfo);		INITIALIZE_PASS_DEPENDENCY(DependenceInfo);
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass);		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass);
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass);		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass);
INITIALIZE_PASS_DEPENDENCY(RegionInfoPass);		INITIALIZE_PASS_DEPENDENCY(RegionInfoPass);
INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass);		INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass);
INITIALIZE_PASS_DEPENDENCY(ScopDetection);		INITIALIZE_PASS_DEPENDENCY(ScopDetection);
INITIALIZE_PASS_END(PPCGCodeGeneration, "polly-codegen-ppcg",		INITIALIZE_PASS_END(PPCGCodeGeneration, "polly-codegen-ppcg",
"Polly - Apply PPCG translation to SCOP", false, false)		"Polly - Apply PPCG translation to SCOP", false, false)

polly/trunk/lib/Support/RegisterPasses.cpp

Show All 17 Lines
// changed, but that the flag '-polly' provided at optimization level '-O3'		// changed, but that the flag '-polly' provided at optimization level '-O3'
// enables additional polyhedral optimizations.		// enables additional polyhedral optimizations.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "polly/RegisterPasses.h"		#include "polly/RegisterPasses.h"
#include "polly/Canonicalization.h"		#include "polly/Canonicalization.h"
#include "polly/CodeGen/CodeGeneration.h"		#include "polly/CodeGen/CodeGeneration.h"
#include "polly/CodeGen/CodegenCleanup.h"		#include "polly/CodeGen/CodegenCleanup.h"
		#include "polly/CodeGen/PPCGCodeGeneration.h"
#include "polly/DeLICM.h"		#include "polly/DeLICM.h"
#include "polly/DependenceInfo.h"		#include "polly/DependenceInfo.h"
#include "polly/FlattenSchedule.h"		#include "polly/FlattenSchedule.h"
#include "polly/LinkAllPasses.h"		#include "polly/LinkAllPasses.h"
#include "polly/Options.h"		#include "polly/Options.h"
#include "polly/PolyhedralInfo.h"		#include "polly/PolyhedralInfo.h"
#include "polly/ScopDetection.h"		#include "polly/ScopDetection.h"
#include "polly/ScopInfo.h"		#include "polly/ScopInfo.h"
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	Target("polly-target", cl::desc("The hardware to target"),
cl::values(clEnumValN(TARGET_CPU, "cpu", "generate CPU code")		cl::values(clEnumValN(TARGET_CPU, "cpu", "generate CPU code")
#ifdef GPU_CODEGEN		#ifdef GPU_CODEGEN
,		,
clEnumValN(TARGET_GPU, "gpu", "generate GPU code")		clEnumValN(TARGET_GPU, "gpu", "generate GPU code")
#endif		#endif
),		),
cl::init(TARGET_CPU), cl::ZeroOrMore, cl::cat(PollyCategory));		cl::init(TARGET_CPU), cl::ZeroOrMore, cl::cat(PollyCategory));

		#ifdef GPU_CODEGEN
		static cl::opt<GPURuntime> GPURuntimeChoice(
		"polly-gpu-runtime", cl::desc("The GPU Runtime API to target"),
		cl::values(clEnumValN(GPURuntime::CUDA, "libcudart",
		"use the CUDA Runtime API"),
		clEnumValN(GPURuntime::OpenCL, "libopencl",
		"use the OpenCL Runtime API")),
		cl::init(GPURuntime::CUDA), cl::ZeroOrMore, cl::cat(PollyCategory));

		static cl::opt<GPUArch>
		GPUArchChoice("polly-gpu-arch", cl::desc("The GPU Architecture to target"),
		cl::values(clEnumValN(GPUArch::NVPTX64, "nvptx64",
		"target NVIDIA 64-bit architecture")),
		cl::init(GPUArch::NVPTX64), cl::ZeroOrMore,
		cl::cat(PollyCategory));
		#endif

VectorizerChoice polly::PollyVectorizerChoice;		VectorizerChoice polly::PollyVectorizerChoice;
static cl::opt<polly::VectorizerChoice, true> Vectorizer(		static cl::opt<polly::VectorizerChoice, true> Vectorizer(
"polly-vectorizer", cl::desc("Select the vectorization strategy"),		"polly-vectorizer", cl::desc("Select the vectorization strategy"),
cl::values(		cl::values(
clEnumValN(polly::VECTORIZER_NONE, "none", "No Vectorization"),		clEnumValN(polly::VECTORIZER_NONE, "none", "No Vectorization"),
clEnumValN(polly::VECTORIZER_POLLY, "polly",		clEnumValN(polly::VECTORIZER_POLLY, "polly",
"Polly internal vectorizer"),		"Polly internal vectorizer"),
clEnumValN(		clEnumValN(
▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	if (Target == TARGET_GPU) {
}		}
}		}

if (ExportJScop)		if (ExportJScop)
PM.add(polly::createJSONExporterPass());		PM.add(polly::createJSONExporterPass());

if (Target == TARGET_GPU) {		if (Target == TARGET_GPU) {
#ifdef GPU_CODEGEN		#ifdef GPU_CODEGEN
PM.add(polly::createPPCGCodeGenerationPass());		PM.add(
		polly::createPPCGCodeGenerationPass(GPUArchChoice, GPURuntimeChoice));
#endif		#endif
} else {		} else {
switch (CodeGeneration) {		switch (CodeGeneration) {
case CODEGEN_AST:		case CODEGEN_AST:
PM.add(polly::createIslAstInfoPass());		PM.add(polly::createIslAstInfoPass());
break;		break;
case CODEGEN_FULL:		case CODEGEN_FULL:
PM.add(polly::createCodeGenerationPass());		PM.add(polly::createCodeGenerationPass());
▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

polly/trunk/test/GPGPU/cuda-managed-memory-simple.ll

	Show All 29 Lines
	; }			; }
	;			;

	; CHECK-NOT: polly_copyFromHostToDevice			; CHECK-NOT: polly_copyFromHostToDevice
	; CHECK-NOT: polly_copyFromDeviceToHost			; CHECK-NOT: polly_copyFromDeviceToHost
	; CHECK-NOT: polly_freeDeviceMemory			; CHECK-NOT: polly_freeDeviceMemory
	; CHECK-NOT: polly_allocateMemoryForDevice			; CHECK-NOT: polly_allocateMemoryForDevice

	; CHECK: %13 = call i8* @polly_initContext()			; CHECK: %13 = call i8* @polly_initContextCUDA()
	; CHECK-NEXT: %14 = bitcast i32* %A to i8*			; CHECK-NEXT: %14 = bitcast i32* %A to i8*
	; CHECK-NEXT: %15 = getelementptr [2 x i8], [2 x i8]* %polly_launch_0_params, i64 0, i64 0			; CHECK-NEXT: %15 = getelementptr [2 x i8], [2 x i8]* %polly_launch_0_params, i64 0, i64 0
	; CHECK-NEXT: store i8* %14, i8** %polly_launch_0_param_0			; CHECK-NEXT: store i8* %14, i8** %polly_launch_0_param_0
	; CHECK-NEXT: %16 = bitcast i8** %polly_launch_0_param_0 to i8*			; CHECK-NEXT: %16 = bitcast i8** %polly_launch_0_param_0 to i8*
	; CHECK-NEXT: store i8* %16, i8** %15			; CHECK-NEXT: store i8* %16, i8** %15
	; CHECK-NEXT: %17 = bitcast i32* %R to i8*			; CHECK-NEXT: %17 = bitcast i32* %R to i8*
	; CHECK-NEXT: %18 = getelementptr [2 x i8], [2 x i8]* %polly_launch_0_params, i64 0, i64 1			; CHECK-NEXT: %18 = getelementptr [2 x i8], [2 x i8]* %polly_launch_0_params, i64 0, i64 1
	; CHECK-NEXT: store i8* %17, i8** %polly_launch_0_param_1			; CHECK-NEXT: store i8* %17, i8** %polly_launch_0_param_1
	; CHECK-NEXT: %19 = bitcast i8** %polly_launch_0_param_1 to i8*			; CHECK-NEXT: %19 = bitcast i8** %polly_launch_0_param_1 to i8*
	; CHECK-NEXT: store i8* %19, i8** %18			; CHECK-NEXT: store i8* %19, i8** %18
	; CHECK-NEXT: %20 = call i8* @polly_getKernel(i8* getelementptr inbounds ([750 x i8], [750 x i8]* @kernel_0, i32 0, i32 0), i8* getelementptr inbounds ([9 x i8], [9 x i8]* @kernel_0_name, i32 0, i32 0))			; CHECK-NEXT: %20 = call i8* @polly_getKernel(i8* getelementptr inbounds ([750 x i8], [750 x i8]* @kernel_0, i32 0, i32 0), i8* getelementptr inbounds ([9 x i8], [9 x i8]* @kernel_0_name, i32 0, i32 0))
	; CHECK-NEXT: call void @polly_launchKernel(i8* %20, i32 2, i32 1, i32 32, i32 1, i32 1, i8* %polly_launch_0_params_i8ptr)			; CHECK-NEXT: call void @polly_launchKernel(i8* %20, i32 2, i32 1, i32 32, i32 1, i32 1, i8* %polly_launch_0_params_i8ptr)
	; CHECK-NEXT: call void @polly_freeKernel(i8* %20)			; CHECK-NEXT: call void @polly_freeKernel(i8* %20)
	; CHECK-NEXT: call void @polly_synchronizeDevice()			; CHECK-NEXT: call void @polly_synchronizeDevice()
	; CHECK-NEXT: call void @polly_freeContext(i8* %13)			; CHECK-NEXT: call void @polly_freeContext(i8* %13)

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define void @copy(i32* %R, i32* %A) {			define void @copy(i32* %R, i32* %A) {
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

polly/trunk/test/GPGPU/size-cast.ll

	Show All 23 Lines
	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_arg2, dev_MemRef_arg2, (arg) * sizeof(double), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_arg2, dev_MemRef_arg2, (arg) * sizeof(double), cudaMemcpyDeviceToHost));
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: for (int c0 = 0; c0 <= (arg - 32 * b0 - 1) / 1048576; c0 += 1)			; CODE-NEXT: for (int c0 = 0; c0 <= (arg - 32 * b0 - 1) / 1048576; c0 += 1)
	; CODE-NEXT: if (arg >= 32 * b0 + t0 + 1048576 * c0 + 1)			; CODE-NEXT: if (arg >= 32 * b0 + t0 + 1048576 * c0 + 1)
	; CODE-NEXT: Stmt_bb6(0, 32 * b0 + t0 + 1048576 * c0);			; CODE-NEXT: Stmt_bb6(0, 32 * b0 + t0 + 1048576 * c0);

	; IR: call i8* @polly_initContext()			; IR: call i8* @polly_initContextCUDA()
	; IR-NEXT: sext i32 %arg to i64			; IR-NEXT: sext i32 %arg to i64
	; IR-NEXT: mul i64			; IR-NEXT: mul i64
	; IR-NEXT: @polly_allocateMemoryForDevice			; IR-NEXT: @polly_allocateMemoryForDevice

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	define void @hoge(i32 %arg, i32 %arg1, [1000 x double]* %arg2, double* %arg3) {			define void @hoge(i32 %arg, i32 %arg1, [1000 x double]* %arg2, double* %arg3) {
	Show All 25 Lines

polly/trunk/tools/CMakeLists.txt

	if (CUDALIB_FOUND)			if (CUDALIB_FOUND OR OpenCL_FOUND)
	add_subdirectory(GPURuntime)			add_subdirectory(GPURuntime)
	endif (CUDALIB_FOUND)			endif (CUDALIB_FOUND OR OpenCL_FOUND)

	set(LLVM_COMMON_DEPENDS ${LLVM_COMMON_DEPENDS} PARENT_SCOPE)			set(LLVM_COMMON_DEPENDS ${LLVM_COMMON_DEPENDS} PARENT_SCOPE)

polly/trunk/tools/GPURuntime/GPUJIT.h

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	* polly_copyFromDeviceToHost(HostData, DevData, MemSize);			* polly_copyFromDeviceToHost(HostData, DevData, MemSize);
	* polly_freeKernel(Kernel);			* polly_freeKernel(Kernel);
	* polly_freeDeviceMemory(DevArray);			* polly_freeDeviceMemory(DevArray);
	* polly_freeContext(Context);			* polly_freeContext(Context);
	* }			* }
	*			*
	*/			*/

				typedef enum PollyGPURuntimeT {
				RUNTIME_NONE,
				RUNTIME_CUDA,
				RUNTIME_CL
				} PollyGPURuntime;

	typedef struct PollyGPUContextT PollyGPUContext;			typedef struct PollyGPUContextT PollyGPUContext;
	typedef struct PollyGPUFunctionT PollyGPUFunction;			typedef struct PollyGPUFunctionT PollyGPUFunction;
	typedef struct PollyGPUDevicePtrT PollyGPUDevicePtr;			typedef struct PollyGPUDevicePtrT PollyGPUDevicePtr;

	PollyGPUContext *polly_initContext();			typedef struct OpenCLContextT OpenCLContext;
	PollyGPUFunction polly_getKernel(const char PTXBuffer,			typedef struct OpenCLKernelT OpenCLKernel;
				typedef struct OpenCLDevicePtrT OpenCLDevicePtr;

				typedef struct CUDAContextT CUDAContext;
				typedef struct CUDAKernelT CUDAKernel;
				typedef struct CUDADevicePtrT CUDADevicePtr;

				PollyGPUContext *polly_initContextCUDA();
				PollyGPUContext *polly_initContextCL();
				PollyGPUFunction polly_getKernel(const char BinaryBuffer,
	const char *KernelName);			const char *KernelName);
	void polly_freeKernel(PollyGPUFunction *Kernel);			void polly_freeKernel(PollyGPUFunction *Kernel);
	void polly_copyFromHostToDevice(void HostData, PollyGPUDevicePtr DevData,			void polly_copyFromHostToDevice(void HostData, PollyGPUDevicePtr DevData,
	long MemSize);			long MemSize);
	void polly_copyFromDeviceToHost(PollyGPUDevicePtr DevData, void HostData,			void polly_copyFromDeviceToHost(PollyGPUDevicePtr DevData, void HostData,
	long MemSize);			long MemSize);
	void polly_synchronizeDevice();			void polly_synchronizeDevice();
	void polly_launchKernel(PollyGPUFunction *Kernel, unsigned int GridDimX,			void polly_launchKernel(PollyGPUFunction *Kernel, unsigned int GridDimX,
	unsigned int GridDimY, unsigned int BlockSizeX,			unsigned int GridDimY, unsigned int BlockSizeX,
	unsigned int BlockSizeY, unsigned int BlockSizeZ,			unsigned int BlockSizeY, unsigned int BlockSizeZ,
	void **Parameters);			void **Parameters);
	void polly_freeDeviceMemory(PollyGPUDevicePtr *Allocation);			void polly_freeDeviceMemory(PollyGPUDevicePtr *Allocation);
	void polly_freeContext(PollyGPUContext *Context);			void polly_freeContext(PollyGPUContext *Context);
	#endif /* GPUJIT_H_ */			#endif /* GPUJIT_H_ */

polly/trunk/tools/GPURuntime/GPUJIT.c

/****************** GPUJIT.c - GPUJIT Execution Engine ********************/		/****************** GPUJIT.c - GPUJIT Execution Engine ********************/
/* */		/* */
/* The LLVM Compiler Infrastructure */		/* The LLVM Compiler Infrastructure */
/* */		/* */
/* This file is dual licensed under the MIT and the University of Illinois */		/* This file is dual licensed under the MIT and the University of Illinois */
/* Open Source License. See LICENSE.TXT for details. */		/* Open Source License. See LICENSE.TXT for details. */
/* */		/* */
/******************************************************************************/		/******************************************************************************/
/* */		/* */
/* This file implements GPUJIT, a ptx string execution engine for GPU. */		/* This file implements GPUJIT, a ptx string execution engine for GPU. */
/* */		/* */
/******************************************************************************/		/******************************************************************************/

#include "GPUJIT.h"		#include "GPUJIT.h"

		#ifdef HAS_LIBCUDART
#include <cuda.h>		#include <cuda.h>
#include <cuda_runtime.h>		#include <cuda_runtime.h>
		#endif /* HAS_LIBCUDART */

		#ifdef HAS_LIBOPENCL
		#ifdef __APPLE__
		#include <OpenCL/opencl.h>
		#else
		#include <CL/cl.h>
		#endif
		#endif /* HAS_LIBOPENCL */

#include <dlfcn.h>		#include <dlfcn.h>
#include <stdarg.h>		#include <stdarg.h>
#include <stdio.h>		#include <stdio.h>
#include <string.h>		#include <string.h>

static int DebugMode;		static int DebugMode;
static int CacheMode;		static int CacheMode;

		static PollyGPURuntime Runtime = RUNTIME_NONE;

static void debug_print(const char *format, ...) {		static void debug_print(const char *format, ...) {
if (!DebugMode)		if (!DebugMode)
return;		return;

va_list args;		va_list args;
va_start(args, format);		va_start(args, format);
vfprintf(stderr, format, args);		vfprintf(stderr, format, args);
va_end(args);		va_end(args);
}		}
#define dump_function() debug_print("-> %s\n", __func__)		#define dump_function() debug_print("-> %s\n", __func__)

/* Define Polly's GPGPU data types. */		#define KERNEL_CACHE_SIZE 10

		static void err_runtime() {
		fprintf(stderr, "Runtime not correctly initialized.\n");
		exit(-1);
		}

struct PollyGPUContextT {		struct PollyGPUContextT {
CUcontext Cuda;		void *Context;
};		};

struct PollyGPUFunctionT {		struct PollyGPUFunctionT {
		void *Kernel;
		};

		struct PollyGPUDevicePtrT {
		void *DevicePtr;
		};

		/******************************************************************************/
		/* OpenCL */
		/******************************************************************************/
		#ifdef HAS_LIBOPENCL

		struct OpenCLContextT {
		cl_context Context;
		cl_command_queue CommandQueue;
		};

		struct OpenCLKernelT {
		cl_kernel Kernel;
		cl_program Program;
		const char *BinaryString;
		};

		struct OpenCLDevicePtrT {
		cl_mem MemObj;
		};

		/* Dynamic library handles for the OpenCL runtime library. */
		static void *HandleOpenCL;

		/* Type-defines of function pointer to OpenCL Runtime API. */
		typedef cl_int clGetPlatformIDsFcnTy(cl_uint NumEntries,
		cl_platform_id *Platforms,
		cl_uint *NumPlatforms);
		static clGetPlatformIDsFcnTy *clGetPlatformIDsFcnPtr;

		typedef cl_int clGetDeviceIDsFcnTy(cl_platform_id Platform,
		cl_device_type DeviceType,
		cl_uint NumEntries, cl_device_id *Devices,
		cl_uint *NumDevices);
		static clGetDeviceIDsFcnTy *clGetDeviceIDsFcnPtr;

		typedef cl_int clGetDeviceInfoFcnTy(cl_device_id Device,
		cl_device_info ParamName,
		size_t ParamValueSize, void *ParamValue,
		size_t *ParamValueSizeRet);
		static clGetDeviceInfoFcnTy *clGetDeviceInfoFcnPtr;

		typedef cl_int clGetKernelInfoFcnTy(cl_kernel Kernel, cl_kernel_info ParamName,
		size_t ParamValueSize, void *ParamValue,
		size_t *ParamValueSizeRet);
		static clGetKernelInfoFcnTy *clGetKernelInfoFcnPtr;

		typedef cl_context clCreateContextFcnTy(
		const cl_context_properties *Properties, cl_uint NumDevices,
		const cl_device_id *Devices,
		void CL_CALLBACK pfn_notify(const char Errinfo, const void *PrivateInfo,
		size_t CB, void *UserData),
		void UserData, cl_int ErrcodeRet);
		static clCreateContextFcnTy *clCreateContextFcnPtr;

		typedef cl_command_queue
		clCreateCommandQueueFcnTy(cl_context Context, cl_device_id Device,
		cl_command_queue_properties Properties,
		cl_int *ErrcodeRet);
		static clCreateCommandQueueFcnTy *clCreateCommandQueueFcnPtr;

		typedef cl_mem clCreateBufferFcnTy(cl_context Context, cl_mem_flags Flags,
		size_t Size, void *HostPtr,
		cl_int *ErrcodeRet);
		static clCreateBufferFcnTy *clCreateBufferFcnPtr;

		typedef cl_int
		clEnqueueWriteBufferFcnTy(cl_command_queue CommandQueue, cl_mem Buffer,
		cl_bool BlockingWrite, size_t Offset, size_t Size,
		const void *Ptr, cl_uint NumEventsInWaitList,
		const cl_event EventWaitList, cl_event Event);
		static clEnqueueWriteBufferFcnTy *clEnqueueWriteBufferFcnPtr;

		typedef cl_program clCreateProgramWithBinaryFcnTy(
		cl_context Context, cl_uint NumDevices, const cl_device_id *DeviceList,
		const size_t Lengths, const unsigned char Binaries, cl_int BinaryStatus,
		cl_int *ErrcodeRet);
		static clCreateProgramWithBinaryFcnTy *clCreateProgramWithBinaryFcnPtr;

		typedef cl_int clBuildProgramFcnTy(
		cl_program Program, cl_uint NumDevices, const cl_device_id *DeviceList,
		const char *Options,
		void(CL_CALLBACK pfn_notify)(cl_program Program, void UserData),
		void *UserData);
		static clBuildProgramFcnTy *clBuildProgramFcnPtr;

		typedef cl_kernel clCreateKernelFcnTy(cl_program Program,
		const char *KernelName,
		cl_int *ErrcodeRet);
		static clCreateKernelFcnTy *clCreateKernelFcnPtr;

		typedef cl_int clSetKernelArgFcnTy(cl_kernel Kernel, cl_uint ArgIndex,
		size_t ArgSize, const void *ArgValue);
		static clSetKernelArgFcnTy *clSetKernelArgFcnPtr;

		typedef cl_int clEnqueueNDRangeKernelFcnTy(
		cl_command_queue CommandQueue, cl_kernel Kernel, cl_uint WorkDim,
		const size_t GlobalWorkOffset, const size_t GlobalWorkSize,
		const size_t *LocalWorkSize, cl_uint NumEventsInWaitList,
		const cl_event EventWaitList, cl_event Event);
		static clEnqueueNDRangeKernelFcnTy *clEnqueueNDRangeKernelFcnPtr;

		typedef cl_int clEnqueueReadBufferFcnTy(cl_command_queue CommandQueue,
		cl_mem Buffer, cl_bool BlockingRead,
		size_t Offset, size_t Size, void *Ptr,
		cl_uint NumEventsInWaitList,
		const cl_event *EventWaitList,
		cl_event *Event);
		static clEnqueueReadBufferFcnTy *clEnqueueReadBufferFcnPtr;

		typedef cl_int clFlushFcnTy(cl_command_queue CommandQueue);
		static clFlushFcnTy *clFlushFcnPtr;

		typedef cl_int clFinishFcnTy(cl_command_queue CommandQueue);
		static clFinishFcnTy *clFinishFcnPtr;

		typedef cl_int clReleaseKernelFcnTy(cl_kernel Kernel);
		static clReleaseKernelFcnTy *clReleaseKernelFcnPtr;

		typedef cl_int clReleaseProgramFcnTy(cl_program Program);
		static clReleaseProgramFcnTy *clReleaseProgramFcnPtr;

		typedef cl_int clReleaseMemObjectFcnTy(cl_mem Memobject);
		static clReleaseMemObjectFcnTy *clReleaseMemObjectFcnPtr;

		typedef cl_int clReleaseCommandQueueFcnTy(cl_command_queue CommandQueue);
		static clReleaseCommandQueueFcnTy *clReleaseCommandQueueFcnPtr;

		typedef cl_int clReleaseContextFcnTy(cl_context Context);
		static clReleaseContextFcnTy *clReleaseContextFcnPtr;

		static void getAPIHandleCL(void Handle, const char *FuncName) {
		char *Err;
		void *FuncPtr;
		dlerror();
		FuncPtr = dlsym(Handle, FuncName);
		if ((Err = dlerror()) != 0) {
		fprintf(stderr, "Load OpenCL Runtime API failed: %s. \n", Err);
		return 0;
		}
		return FuncPtr;
		}

		static int initialDeviceAPILibrariesCL() {
		HandleOpenCL = dlopen("libOpenCL.so", RTLD_LAZY);
		if (!HandleOpenCL) {
		fprintf(stderr, "Cannot open library: %s. \n", dlerror());
		return 0;
		}
		return 1;
		}

		static int initialDeviceAPIsCL() {
		if (initialDeviceAPILibrariesCL() == 0)
		return 0;

		/* Get function pointer to OpenCL Runtime API.
		*
		* Note that compilers conforming to the ISO C standard are required to
		* generate a warning if a conversion from a void * pointer to a function
		* pointer is attempted as in the following statements. The warning
		* of this kind of cast may not be emitted by clang and new versions of gcc
		* as it is valid on POSIX 2008.
		*/
		clGetPlatformIDsFcnPtr =
		(clGetPlatformIDsFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetPlatformIDs");

		clGetDeviceIDsFcnPtr =
		(clGetDeviceIDsFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetDeviceIDs");

		clGetDeviceInfoFcnPtr =
		(clGetDeviceInfoFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetDeviceInfo");

		clGetKernelInfoFcnPtr =
		(clGetKernelInfoFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetKernelInfo");

		clCreateContextFcnPtr =
		(clCreateContextFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateContext");

		clCreateCommandQueueFcnPtr = (clCreateCommandQueueFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clCreateCommandQueue");

		clCreateBufferFcnPtr =
		(clCreateBufferFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateBuffer");

		clEnqueueWriteBufferFcnPtr = (clEnqueueWriteBufferFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clEnqueueWriteBuffer");

		clCreateProgramWithBinaryFcnPtr =
		(clCreateProgramWithBinaryFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clCreateProgramWithBinary");

		clBuildProgramFcnPtr =
		(clBuildProgramFcnTy *)getAPIHandleCL(HandleOpenCL, "clBuildProgram");

		clCreateKernelFcnPtr =
		(clCreateKernelFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateKernel");

		clSetKernelArgFcnPtr =
		(clSetKernelArgFcnTy *)getAPIHandleCL(HandleOpenCL, "clSetKernelArg");

		clEnqueueNDRangeKernelFcnPtr = (clEnqueueNDRangeKernelFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clEnqueueNDRangeKernel");

		clEnqueueReadBufferFcnPtr = (clEnqueueReadBufferFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clEnqueueReadBuffer");

		clFlushFcnPtr = (clFlushFcnTy *)getAPIHandleCL(HandleOpenCL, "clFlush");

		clFinishFcnPtr = (clFinishFcnTy *)getAPIHandleCL(HandleOpenCL, "clFinish");

		clReleaseKernelFcnPtr =
		(clReleaseKernelFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseKernel");

		clReleaseProgramFcnPtr =
		(clReleaseProgramFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseProgram");

		clReleaseMemObjectFcnPtr = (clReleaseMemObjectFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clReleaseMemObject");

		clReleaseCommandQueueFcnPtr = (clReleaseCommandQueueFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clReleaseCommandQueue");

		clReleaseContextFcnPtr =
		(clReleaseContextFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseContext");

		return 1;
		}

		/* Context and Device. */
		static PollyGPUContext *GlobalContext = NULL;
		static cl_device_id GlobalDeviceID = NULL;

		/* Fd-Decl: Print out OpenCL Error codes to human readable strings. */
		static void printOpenCLError(int Error);

		static void checkOpenCLError(int Ret, const char *format, ...) {
		if (Ret == CL_SUCCESS)
		return;

		printOpenCLError(Ret);
		va_list args;
		va_start(args, format);
		vfprintf(stderr, format, args);
		va_end(args);
		exit(-1);
		}

		static PollyGPUContext *initContextCL() {
		dump_function();

		PollyGPUContext *Context;

		cl_platform_id PlatformID = NULL;
		cl_device_id DeviceID = NULL;
		cl_uint NumDevicesRet;
		cl_int Ret;

		char DeviceRevision[256];
		char DeviceName[256];
		size_t DeviceRevisionRetSize, DeviceNameRetSize;

		static __thread PollyGPUContext *CurrentContext = NULL;

		if (CurrentContext)
		return CurrentContext;

		/* Get API handles. */
		if (initialDeviceAPIsCL() == 0) {
		fprintf(stderr, "Getting the \"handle\" for the OpenCL Runtime failed.\n");
		exit(-1);
		}

		/* Get number of devices that support OpenCL. */
		static const int NumberOfPlatforms = 1;
		Ret = clGetPlatformIDsFcnPtr(NumberOfPlatforms, &PlatformID, NULL);
		checkOpenCLError(Ret, "Failed to get platform IDs.\n");
		// TODO: Extend to CL_DEVICE_TYPE_ALL?
		static const int NumberOfDevices = 1;
		Ret = clGetDeviceIDsFcnPtr(PlatformID, CL_DEVICE_TYPE_GPU, NumberOfDevices,
		&DeviceID, &NumDevicesRet);
		checkOpenCLError(Ret, "Failed to get device IDs.\n");

		GlobalDeviceID = DeviceID;
		if (NumDevicesRet == 0) {
		fprintf(stderr, "There is no device supporting OpenCL.\n");
		exit(-1);
		}

		/* Get device revision. */
		Ret =
		clGetDeviceInfoFcnPtr(DeviceID, CL_DEVICE_VERSION, sizeof(DeviceRevision),
		DeviceRevision, &DeviceRevisionRetSize);
		checkOpenCLError(Ret, "Failed to fetch device revision.\n");

		/* Get device name. */
		Ret = clGetDeviceInfoFcnPtr(DeviceID, CL_DEVICE_NAME, sizeof(DeviceName),
		DeviceName, &DeviceNameRetSize);
		checkOpenCLError(Ret, "Failed to fetch device name.\n");

		debug_print("> Running on GPU device %d : %s.\n", DeviceID, DeviceName);

		/* Create context on the device. */
		Context = (PollyGPUContext *)malloc(sizeof(PollyGPUContext));
		if (Context == 0) {
		fprintf(stderr, "Allocate memory for Polly GPU context failed.\n");
		exit(-1);
		}
		Context->Context = (OpenCLContext *)malloc(sizeof(OpenCLContext));
		if (Context->Context == 0) {
		fprintf(stderr, "Allocate memory for Polly OpenCL context failed.\n");
		exit(-1);
		}
		((OpenCLContext *)Context->Context)->Context =
		clCreateContextFcnPtr(NULL, NumDevicesRet, &DeviceID, NULL, NULL, &Ret);
		checkOpenCLError(Ret, "Failed to create context.\n");

		static const int ExtraProperties = 0;
		((OpenCLContext *)Context->Context)->CommandQueue =
		clCreateCommandQueueFcnPtr(((OpenCLContext *)Context->Context)->Context,
		DeviceID, ExtraProperties, &Ret);
		checkOpenCLError(Ret, "Failed to create command queue.\n");

		if (CacheMode)
		CurrentContext = Context;

		GlobalContext = Context;
		return Context;
		}

		static void freeKernelCL(PollyGPUFunction *Kernel) {
		dump_function();

		if (CacheMode)
		return;

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		cl_int Ret;
		Ret = clFlushFcnPtr(((OpenCLContext *)GlobalContext->Context)->CommandQueue);
		checkOpenCLError(Ret, "Failed to flush command queue.\n");
		Ret = clFinishFcnPtr(((OpenCLContext *)GlobalContext->Context)->CommandQueue);
		checkOpenCLError(Ret, "Failed to finish command queue.\n");

		if (((OpenCLKernel *)Kernel->Kernel)->Kernel) {
		cl_int Ret =
		clReleaseKernelFcnPtr(((OpenCLKernel *)Kernel->Kernel)->Kernel);
		checkOpenCLError(Ret, "Failed to release kernel.\n");
		}

		if (((OpenCLKernel *)Kernel->Kernel)->Program) {
		cl_int Ret =
		clReleaseProgramFcnPtr(((OpenCLKernel *)Kernel->Kernel)->Program);
		checkOpenCLError(Ret, "Failed to release program.\n");
		}

		if (Kernel->Kernel)
		free((OpenCLKernel *)Kernel->Kernel);

		if (Kernel)
		free(Kernel);
		}

		static PollyGPUFunction getKernelCL(const char BinaryBuffer,
		const char *KernelName) {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		static __thread PollyGPUFunction *KernelCache[KERNEL_CACHE_SIZE];
		static __thread int NextCacheItem = 0;

		for (long i = 0; i < KERNEL_CACHE_SIZE; i++) {
		// We exploit here the property that all Polly-ACC kernels are allocated
		// as global constants, hence a pointer comparision is sufficient to
		// determin equality.
		if (KernelCache[i] &&
		((OpenCLKernel *)KernelCache[i]->Kernel)->BinaryString ==
		BinaryBuffer) {
		debug_print(" -> using cached kernel\n");
		return KernelCache[i];
		}
		}

		PollyGPUFunction *Function = malloc(sizeof(PollyGPUFunction));
		if (Function == 0) {
		fprintf(stderr, "Allocate memory for Polly GPU function failed.\n");
		exit(-1);
		}
		Function->Kernel = (OpenCLKernel *)malloc(sizeof(OpenCLKernel));
		if (Function->Kernel == 0) {
		fprintf(stderr, "Allocate memory for Polly OpenCL kernel failed.\n");
		exit(-1);
		}

		if (!GlobalDeviceID) {
		fprintf(stderr, "GPGPU-code generation not initialized correctly.\n");
		exit(-1);
		}

		cl_int Ret;
		size_t BinarySize = strlen(BinaryBuffer);
		((OpenCLKernel *)Function->Kernel)->Program = clCreateProgramWithBinaryFcnPtr(
		((OpenCLContext *)GlobalContext->Context)->Context, 1, &GlobalDeviceID,
		(const size_t )&BinarySize, (const unsigned char *)&BinaryBuffer, NULL,
		&Ret);
		checkOpenCLError(Ret, "Failed to create program from binary.\n");

		Ret = clBuildProgramFcnPtr(((OpenCLKernel *)Function->Kernel)->Program, 1,
		&GlobalDeviceID, NULL, NULL, NULL);
		checkOpenCLError(Ret, "Failed to build program.\n");

		((OpenCLKernel *)Function->Kernel)->Kernel = clCreateKernelFcnPtr(
		((OpenCLKernel *)Function->Kernel)->Program, KernelName, &Ret);
		checkOpenCLError(Ret, "Failed to create kernel.\n");

		((OpenCLKernel *)Function->Kernel)->BinaryString = BinaryBuffer;

		if (CacheMode) {
		if (KernelCache[NextCacheItem])
		freeKernelCL(KernelCache[NextCacheItem]);

		KernelCache[NextCacheItem] = Function;

		NextCacheItem = (NextCacheItem + 1) % KERNEL_CACHE_SIZE;
		}

		return Function;
		}

		static void copyFromHostToDeviceCL(void HostData, PollyGPUDevicePtr DevData,
		long MemSize) {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		cl_int Ret;
		Ret = clEnqueueWriteBufferFcnPtr(
		((OpenCLContext *)GlobalContext->Context)->CommandQueue,
		((OpenCLDevicePtr *)DevData->DevicePtr)->MemObj, CL_TRUE, 0, MemSize,
		HostData, 0, NULL, NULL);
		checkOpenCLError(Ret, "Copying data from host memory to device failed.\n");
		}

		static void copyFromDeviceToHostCL(PollyGPUDevicePtr DevData, void HostData,
		long MemSize) {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		cl_int Ret;
		Ret = clEnqueueReadBufferFcnPtr(
		((OpenCLContext *)GlobalContext->Context)->CommandQueue,
		((OpenCLDevicePtr *)DevData->DevicePtr)->MemObj, CL_TRUE, 0, MemSize,
		HostData, 0, NULL, NULL);
		checkOpenCLError(Ret, "Copying results from device to host memory failed.\n");
		}

		static void launchKernelCL(PollyGPUFunction *Kernel, unsigned int GridDimX,
		unsigned int GridDimY, unsigned int BlockDimX,
		unsigned int BlockDimY, unsigned int BlockDimZ,
		void **Parameters) {
		dump_function();

		cl_int Ret;
		cl_uint NumArgs;

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		OpenCLKernel CLKernel = (OpenCLKernel )Kernel->Kernel;
		Ret = clGetKernelInfoFcnPtr(CLKernel->Kernel, CL_KERNEL_NUM_ARGS,
		sizeof(cl_uint), &NumArgs, NULL);
		checkOpenCLError(Ret, "Failed to get number of kernel arguments.\n");

		// TODO: Pass the size of the kernel arguments in to launchKernelCL, along
		// with the arguments themselves. This is a dirty workaround that can be
		// broken.
		for (cl_uint i = 0; i < NumArgs; i++) {
		Ret = clSetKernelArgFcnPtr(CLKernel->Kernel, i, 8, (void *)Parameters[i]);
		if (Ret == CL_INVALID_ARG_SIZE) {
		Ret = clSetKernelArgFcnPtr(CLKernel->Kernel, i, 4, (void *)Parameters[i]);
		if (Ret == CL_INVALID_ARG_SIZE) {
		Ret =
		clSetKernelArgFcnPtr(CLKernel->Kernel, i, 2, (void *)Parameters[i]);
		if (Ret == CL_INVALID_ARG_SIZE) {
		Ret = clSetKernelArgFcnPtr(CLKernel->Kernel, i, 1,
		(void *)Parameters[i]);
		checkOpenCLError(Ret, "Failed to set Kernel argument %d.\n", i);
		}
		}
		}
		if (Ret != CL_SUCCESS && Ret != CL_INVALID_ARG_SIZE) {
		fprintf(stderr, "Failed to set Kernel argument.\n");
		printOpenCLError(Ret);
		exit(-1);
		}
		}

		unsigned int GridDimZ = 1;
		size_t GlobalWorkSize[3] = {BlockDimX * GridDimX, BlockDimY * GridDimY,
		BlockDimZ * GridDimZ};
		size_t LocalWorkSize[3] = {BlockDimX, BlockDimY, BlockDimZ};

		static const int WorkDim = 3;
		OpenCLContext CLContext = (OpenCLContext )GlobalContext->Context;
		Ret = clEnqueueNDRangeKernelFcnPtr(CLContext->CommandQueue, CLKernel->Kernel,
		WorkDim, NULL, GlobalWorkSize,
		LocalWorkSize, 0, NULL, NULL);
		checkOpenCLError(Ret, "Launching OpenCL kernel failed.\n");
		}

		static void freeDeviceMemoryCL(PollyGPUDevicePtr *Allocation) {
		dump_function();

		OpenCLDevicePtr DevPtr = (OpenCLDevicePtr )Allocation->DevicePtr;
		cl_int Ret = clReleaseMemObjectFcnPtr((cl_mem)DevPtr->MemObj);
		checkOpenCLError(Ret, "Failed to free device memory.\n");

		free(DevPtr);
		free(Allocation);
		}

		static PollyGPUDevicePtr *allocateMemoryForDeviceCL(long MemSize) {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		PollyGPUDevicePtr *DevData = malloc(sizeof(PollyGPUDevicePtr));
		if (DevData == 0) {
		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
		exit(-1);
		}
		DevData->DevicePtr = (OpenCLDevicePtr *)malloc(sizeof(OpenCLDevicePtr));
		if (DevData->DevicePtr == 0) {
		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
		exit(-1);
		}

		cl_int Ret;
		((OpenCLDevicePtr *)DevData->DevicePtr)->MemObj =
		clCreateBufferFcnPtr(((OpenCLContext *)GlobalContext->Context)->Context,
		CL_MEM_READ_WRITE, MemSize, NULL, &Ret);
		checkOpenCLError(Ret,
		"Allocate memory for GPU device memory pointer failed.\n");

		return DevData;
		}

		static void getDevicePtrCL(PollyGPUDevicePtr Allocation) {
		dump_function();

		OpenCLDevicePtr DevPtr = (OpenCLDevicePtr )Allocation->DevicePtr;
		return (void *)DevPtr->MemObj;
		}

		static void synchronizeDeviceCL() {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		if (clFinishFcnPtr(((OpenCLContext *)GlobalContext->Context)->CommandQueue) !=
		CL_SUCCESS) {
		fprintf(stderr, "Synchronizing device and host memory failed.\n");
		exit(-1);
		}
		}

		static void freeContextCL(PollyGPUContext *Context) {
		dump_function();

		cl_int Ret;

		GlobalContext = NULL;

		OpenCLContext Ctx = (OpenCLContext )Context->Context;
		if (Ctx->CommandQueue) {
		Ret = clReleaseCommandQueueFcnPtr(Ctx->CommandQueue);
		checkOpenCLError(Ret, "Could not release command queue.\n");
		}

		if (Ctx->Context) {
		Ret = clReleaseContextFcnPtr(Ctx->Context);
		checkOpenCLError(Ret, "Could not release context.\n");
		}

		free(Ctx);
		free(Context);
		}

		static void printOpenCLError(int Error) {

		switch (Error) {
		case CL_SUCCESS:
		// Success, don't print an error.
		break;

		// JIT/Runtime errors.
		case CL_DEVICE_NOT_FOUND:
		fprintf(stderr, "Device not found.\n");
		break;
		case CL_DEVICE_NOT_AVAILABLE:
		fprintf(stderr, "Device not available.\n");
		break;
		case CL_COMPILER_NOT_AVAILABLE:
		fprintf(stderr, "Compiler not available.\n");
		break;
		case CL_MEM_OBJECT_ALLOCATION_FAILURE:
		fprintf(stderr, "Mem object allocation failure.\n");
		break;
		case CL_OUT_OF_RESOURCES:
		fprintf(stderr, "Out of resources.\n");
		break;
		case CL_OUT_OF_HOST_MEMORY:
		fprintf(stderr, "Out of host memory.\n");
		break;
		case CL_PROFILING_INFO_NOT_AVAILABLE:
		fprintf(stderr, "Profiling info not available.\n");
		break;
		case CL_MEM_COPY_OVERLAP:
		fprintf(stderr, "Mem copy overlap.\n");
		break;
		case CL_IMAGE_FORMAT_MISMATCH:
		fprintf(stderr, "Image format mismatch.\n");
		break;
		case CL_IMAGE_FORMAT_NOT_SUPPORTED:
		fprintf(stderr, "Image format not supported.\n");
		break;
		case CL_BUILD_PROGRAM_FAILURE:
		fprintf(stderr, "Build program failure.\n");
		break;
		case CL_MAP_FAILURE:
		fprintf(stderr, "Map failure.\n");
		break;
		case CL_MISALIGNED_SUB_BUFFER_OFFSET:
		fprintf(stderr, "Misaligned sub buffer offset.\n");
		break;
		case CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST:
		fprintf(stderr, "Exec status error for events in wait list.\n");
		break;
		case CL_COMPILE_PROGRAM_FAILURE:
		fprintf(stderr, "Compile program failure.\n");
		break;
		case CL_LINKER_NOT_AVAILABLE:
		fprintf(stderr, "Linker not available.\n");
		break;
		case CL_LINK_PROGRAM_FAILURE:
		fprintf(stderr, "Link program failure.\n");
		break;
		case CL_DEVICE_PARTITION_FAILED:
		fprintf(stderr, "Device partition failed.\n");
		break;
		case CL_KERNEL_ARG_INFO_NOT_AVAILABLE:
		fprintf(stderr, "Kernel arg info not available.\n");
		break;

		// Compiler errors.
		case CL_INVALID_VALUE:
		fprintf(stderr, "Invalid value.\n");
		break;
		case CL_INVALID_DEVICE_TYPE:
		fprintf(stderr, "Invalid device type.\n");
		break;
		case CL_INVALID_PLATFORM:
		fprintf(stderr, "Invalid platform.\n");
		break;
		case CL_INVALID_DEVICE:
		fprintf(stderr, "Invalid device.\n");
		break;
		case CL_INVALID_CONTEXT:
		fprintf(stderr, "Invalid context.\n");
		break;
		case CL_INVALID_QUEUE_PROPERTIES:
		fprintf(stderr, "Invalid queue properties.\n");
		break;
		case CL_INVALID_COMMAND_QUEUE:
		fprintf(stderr, "Invalid command queue.\n");
		break;
		case CL_INVALID_HOST_PTR:
		fprintf(stderr, "Invalid host pointer.\n");
		break;
		case CL_INVALID_MEM_OBJECT:
		fprintf(stderr, "Invalid memory object.\n");
		break;
		case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR:
		fprintf(stderr, "Invalid image format descriptor.\n");
		break;
		case CL_INVALID_IMAGE_SIZE:
		fprintf(stderr, "Invalid image size.\n");
		break;
		case CL_INVALID_SAMPLER:
		fprintf(stderr, "Invalid sampler.\n");
		break;
		case CL_INVALID_BINARY:
		fprintf(stderr, "Invalid binary.\n");
		break;
		case CL_INVALID_BUILD_OPTIONS:
		fprintf(stderr, "Invalid build options.\n");
		break;
		case CL_INVALID_PROGRAM:
		fprintf(stderr, "Invalid program.\n");
		break;
		case CL_INVALID_PROGRAM_EXECUTABLE:
		fprintf(stderr, "Invalid program executable.\n");
		break;
		case CL_INVALID_KERNEL_NAME:
		fprintf(stderr, "Invalid kernel name.\n");
		break;
		case CL_INVALID_KERNEL_DEFINITION:
		fprintf(stderr, "Invalid kernel definition.\n");
		break;
		case CL_INVALID_KERNEL:
		fprintf(stderr, "Invalid kernel.\n");
		break;
		case CL_INVALID_ARG_INDEX:
		fprintf(stderr, "Invalid arg index.\n");
		break;
		case CL_INVALID_ARG_VALUE:
		fprintf(stderr, "Invalid arg value.\n");
		break;
		case CL_INVALID_ARG_SIZE:
		fprintf(stderr, "Invalid arg size.\n");
		break;
		case CL_INVALID_KERNEL_ARGS:
		fprintf(stderr, "Invalid kernel args.\n");
		break;
		case CL_INVALID_WORK_DIMENSION:
		fprintf(stderr, "Invalid work dimension.\n");
		break;
		case CL_INVALID_WORK_GROUP_SIZE:
		fprintf(stderr, "Invalid work group size.\n");
		break;
		case CL_INVALID_WORK_ITEM_SIZE:
		fprintf(stderr, "Invalid work item size.\n");
		break;
		case CL_INVALID_GLOBAL_OFFSET:
		fprintf(stderr, "Invalid global offset.\n");
		break;
		case CL_INVALID_EVENT_WAIT_LIST:
		fprintf(stderr, "Invalid event wait list.\n");
		break;
		case CL_INVALID_EVENT:
		fprintf(stderr, "Invalid event.\n");
		break;
		case CL_INVALID_OPERATION:
		fprintf(stderr, "Invalid operation.\n");
		break;
		case CL_INVALID_GL_OBJECT:
		fprintf(stderr, "Invalid GL object.\n");
		break;
		case CL_INVALID_BUFFER_SIZE:
		fprintf(stderr, "Invalid buffer size.\n");
		break;
		case CL_INVALID_MIP_LEVEL:
		fprintf(stderr, "Invalid mip level.\n");
		break;
		case CL_INVALID_GLOBAL_WORK_SIZE:
		fprintf(stderr, "Invalid global work size.\n");
		break;
		case CL_INVALID_PROPERTY:
		fprintf(stderr, "Invalid property.\n");
		break;
		case CL_INVALID_IMAGE_DESCRIPTOR:
		fprintf(stderr, "Invalid image descriptor.\n");
		break;
		case CL_INVALID_COMPILER_OPTIONS:
		fprintf(stderr, "Invalid compiler options.\n");
		break;
		case CL_INVALID_LINKER_OPTIONS:
		fprintf(stderr, "Invalid linker options.\n");
		break;
		case CL_INVALID_DEVICE_PARTITION_COUNT:
		fprintf(stderr, "Invalid device partition count.\n");
		break;
		case CL_INVALID_PIPE_SIZE:
		fprintf(stderr, "Invalid pipe size.\n");
		break;
		case CL_INVALID_DEVICE_QUEUE:
		fprintf(stderr, "Invalid device queue.\n");
		break;

		// NVIDIA specific error.
		case -9999:
		fprintf(stderr, "NVIDIA invalid read or write buffer.\n");
		break;

		default:
		fprintf(stderr, "Unknown error code!\n");
		break;
		}
		}

		#endif /* HAS_LIBOPENCL */
		/******************************************************************************/
		/* CUDA */
		/******************************************************************************/
		#ifdef HAS_LIBCUDART

		struct CUDAContextT {
		CUcontext Cuda;
		};

		struct CUDAKernelT {
CUfunction Cuda;		CUfunction Cuda;
CUmodule CudaModule;		CUmodule CudaModule;
const char *PTXString;		const char *BinaryString;
};		};

struct PollyGPUDevicePtrT {		struct CUDADevicePtrT {
CUdeviceptr Cuda;		CUdeviceptr Cuda;
};		};

/* Dynamic library handles for the CUDA and CUDA runtime library. */		/* Dynamic library handles for the CUDA and CUDA runtime library. */
static void *HandleCuda;		static void *HandleCuda;
static void *HandleCudaRT;		static void *HandleCudaRT;

/* Type-defines of function pointer to CUDA driver APIs. */		/* Type-defines of function pointer to CUDA driver APIs. */
typedef CUresult CUDAAPI CuMemAllocFcnTy(CUdeviceptr *, size_t);		typedef CUresult CUDAAPI CuMemAllocFcnTy(CUdeviceptr *, size_t);
static CuMemAllocFcnTy *CuMemAllocFcnPtr;		static CuMemAllocFcnTy *CuMemAllocFcnPtr;

typedef CUresult CUDAAPI CuLaunchKernelFcnTy(		typedef CUresult CUDAAPI CuLaunchKernelFcnTy(
CUfunction F, unsigned int GridDimX, unsigned int GridDimY,		CUfunction F, unsigned int GridDimX, unsigned int GridDimY,
unsigned int GridDimZ, unsigned int BlockDimX, unsigned int BlockDimY,		unsigned int gridDimZ, unsigned int blockDimX, unsigned int BlockDimY,
unsigned int BlockDimZ, unsigned int SharedMemBytes, CUstream HStream,		unsigned int BlockDimZ, unsigned int SharedMemBytes, CUstream HStream,
void KernelParams, void Extra);		void KernelParams, void Extra);
static CuLaunchKernelFcnTy *CuLaunchKernelFcnPtr;		static CuLaunchKernelFcnTy *CuLaunchKernelFcnPtr;

typedef CUresult CUDAAPI CuMemcpyDtoHFcnTy(void *, CUdeviceptr, size_t);		typedef CUresult CUDAAPI CuMemcpyDtoHFcnTy(void *, CUdeviceptr, size_t);
static CuMemcpyDtoHFcnTy *CuMemcpyDtoHFcnPtr;		static CuMemcpyDtoHFcnTy *CuMemcpyDtoHFcnPtr;

typedef CUresult CUDAAPI CuMemcpyHtoDFcnTy(CUdeviceptr, const void *, size_t);		typedef CUresult CUDAAPI CuMemcpyHtoDFcnTy(CUdeviceptr, const void *, size_t);
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines

typedef CUresult CUDAAPI CuCtxSynchronizeFcnTy();		typedef CUresult CUDAAPI CuCtxSynchronizeFcnTy();
static CuCtxSynchronizeFcnTy *CuCtxSynchronizeFcnPtr;		static CuCtxSynchronizeFcnTy *CuCtxSynchronizeFcnPtr;

/* Type-defines of function pointer ot CUDA runtime APIs. */		/* Type-defines of function pointer ot CUDA runtime APIs. */
typedef cudaError_t CUDARTAPI CudaThreadSynchronizeFcnTy(void);		typedef cudaError_t CUDARTAPI CudaThreadSynchronizeFcnTy(void);
static CudaThreadSynchronizeFcnTy *CudaThreadSynchronizeFcnPtr;		static CudaThreadSynchronizeFcnTy *CudaThreadSynchronizeFcnPtr;

static void getAPIHandle(void Handle, const char *FuncName) {		static void getAPIHandleCUDA(void Handle, const char *FuncName) {
char *Err;		char *Err;
void *FuncPtr;		void *FuncPtr;
dlerror();		dlerror();
FuncPtr = dlsym(Handle, FuncName);		FuncPtr = dlsym(Handle, FuncName);
if ((Err = dlerror()) != 0) {		if ((Err = dlerror()) != 0) {
fprintf(stderr, "Load CUDA driver API failed: %s. \n", Err);		fprintf(stderr, "Load CUDA driver API failed: %s. \n", Err);
return 0;		return 0;
}		}
return FuncPtr;		return FuncPtr;
}		}

static int initialDeviceAPILibraries() {		static int initialDeviceAPILibrariesCUDA() {
HandleCuda = dlopen("libcuda.so", RTLD_LAZY);		HandleCuda = dlopen("libcuda.so", RTLD_LAZY);
if (!HandleCuda) {		if (!HandleCuda) {
printf("Cannot open library: %s. \n", dlerror());		fprintf(stderr, "Cannot open library: %s. \n", dlerror());
return 0;		return 0;
}		}

HandleCudaRT = dlopen("libcudart.so", RTLD_LAZY);		HandleCudaRT = dlopen("libcudart.so", RTLD_LAZY);
if (!HandleCudaRT) {		if (!HandleCudaRT) {
printf("Cannot open library: %s. \n", dlerror());		fprintf(stderr, "Cannot open library: %s. \n", dlerror());
return 0;		return 0;
}		}

return 1;		return 1;
}		}

static int initialDeviceAPIs() {		static int initialDeviceAPIsCUDA() {
if (initialDeviceAPILibraries() == 0)		if (initialDeviceAPILibrariesCUDA() == 0)
return 0;		return 0;

/* Get function pointer to CUDA Driver APIs.		/* Get function pointer to CUDA Driver APIs.
*		*
* Note that compilers conforming to the ISO C standard are required to		* Note that compilers conforming to the ISO C standard are required to
* generate a warning if a conversion from a void * pointer to a function		* generate a warning if a conversion from a void * pointer to a function
* pointer is attempted as in the following statements. The warning		* pointer is attempted as in the following statements. The warning
* of this kind of cast may not be emitted by clang and new versions of gcc		* of this kind of cast may not be emitted by clang and new versions of gcc
* as it is valid on POSIX 2008.		* as it is valid on POSIX 2008.
*/		*/
CuLaunchKernelFcnPtr =		CuLaunchKernelFcnPtr =
(CuLaunchKernelFcnTy *)getAPIHandle(HandleCuda, "cuLaunchKernel");		(CuLaunchKernelFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLaunchKernel");

CuMemAllocFcnPtr =		CuMemAllocFcnPtr =
(CuMemAllocFcnTy *)getAPIHandle(HandleCuda, "cuMemAlloc_v2");		(CuMemAllocFcnTy *)getAPIHandleCUDA(HandleCuda, "cuMemAlloc_v2");

CuMemFreeFcnPtr = (CuMemFreeFcnTy *)getAPIHandle(HandleCuda, "cuMemFree_v2");		CuMemFreeFcnPtr =
		(CuMemFreeFcnTy *)getAPIHandleCUDA(HandleCuda, "cuMemFree_v2");

CuMemcpyDtoHFcnPtr =		CuMemcpyDtoHFcnPtr =
(CuMemcpyDtoHFcnTy *)getAPIHandle(HandleCuda, "cuMemcpyDtoH_v2");		(CuMemcpyDtoHFcnTy *)getAPIHandleCUDA(HandleCuda, "cuMemcpyDtoH_v2");

CuMemcpyHtoDFcnPtr =		CuMemcpyHtoDFcnPtr =
(CuMemcpyHtoDFcnTy *)getAPIHandle(HandleCuda, "cuMemcpyHtoD_v2");		(CuMemcpyHtoDFcnTy *)getAPIHandleCUDA(HandleCuda, "cuMemcpyHtoD_v2");

CuModuleUnloadFcnPtr =		CuModuleUnloadFcnPtr =
(CuModuleUnloadFcnTy *)getAPIHandle(HandleCuda, "cuModuleUnload");		(CuModuleUnloadFcnTy *)getAPIHandleCUDA(HandleCuda, "cuModuleUnload");

CuCtxDestroyFcnPtr =		CuCtxDestroyFcnPtr =
(CuCtxDestroyFcnTy *)getAPIHandle(HandleCuda, "cuCtxDestroy");		(CuCtxDestroyFcnTy *)getAPIHandleCUDA(HandleCuda, "cuCtxDestroy");

CuInitFcnPtr = (CuInitFcnTy *)getAPIHandle(HandleCuda, "cuInit");		CuInitFcnPtr = (CuInitFcnTy *)getAPIHandleCUDA(HandleCuda, "cuInit");

CuDeviceGetCountFcnPtr =		CuDeviceGetCountFcnPtr =
(CuDeviceGetCountFcnTy *)getAPIHandle(HandleCuda, "cuDeviceGetCount");		(CuDeviceGetCountFcnTy *)getAPIHandleCUDA(HandleCuda, "cuDeviceGetCount");

CuDeviceGetFcnPtr =		CuDeviceGetFcnPtr =
(CuDeviceGetFcnTy *)getAPIHandle(HandleCuda, "cuDeviceGet");		(CuDeviceGetFcnTy *)getAPIHandleCUDA(HandleCuda, "cuDeviceGet");

CuCtxCreateFcnPtr =		CuCtxCreateFcnPtr =
(CuCtxCreateFcnTy *)getAPIHandle(HandleCuda, "cuCtxCreate_v2");		(CuCtxCreateFcnTy *)getAPIHandleCUDA(HandleCuda, "cuCtxCreate_v2");

CuModuleLoadDataExFcnPtr =		CuModuleLoadDataExFcnPtr = (CuModuleLoadDataExFcnTy *)getAPIHandleCUDA(
(CuModuleLoadDataExFcnTy *)getAPIHandle(HandleCuda, "cuModuleLoadDataEx");		HandleCuda, "cuModuleLoadDataEx");

CuModuleLoadDataFcnPtr =		CuModuleLoadDataFcnPtr =
(CuModuleLoadDataFcnTy *)getAPIHandle(HandleCuda, "cuModuleLoadData");		(CuModuleLoadDataFcnTy *)getAPIHandleCUDA(HandleCuda, "cuModuleLoadData");

CuModuleGetFunctionFcnPtr = (CuModuleGetFunctionFcnTy *)getAPIHandle(		CuModuleGetFunctionFcnPtr = (CuModuleGetFunctionFcnTy *)getAPIHandleCUDA(
HandleCuda, "cuModuleGetFunction");		HandleCuda, "cuModuleGetFunction");

CuDeviceComputeCapabilityFcnPtr =		CuDeviceComputeCapabilityFcnPtr =
(CuDeviceComputeCapabilityFcnTy *)getAPIHandle(		(CuDeviceComputeCapabilityFcnTy *)getAPIHandleCUDA(
HandleCuda, "cuDeviceComputeCapability");		HandleCuda, "cuDeviceComputeCapability");

CuDeviceGetNameFcnPtr =		CuDeviceGetNameFcnPtr =
(CuDeviceGetNameFcnTy *)getAPIHandle(HandleCuda, "cuDeviceGetName");		(CuDeviceGetNameFcnTy *)getAPIHandleCUDA(HandleCuda, "cuDeviceGetName");

CuLinkAddDataFcnPtr =		CuLinkAddDataFcnPtr =
(CuLinkAddDataFcnTy *)getAPIHandle(HandleCuda, "cuLinkAddData");		(CuLinkAddDataFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLinkAddData");

CuLinkCreateFcnPtr =		CuLinkCreateFcnPtr =
(CuLinkCreateFcnTy *)getAPIHandle(HandleCuda, "cuLinkCreate");		(CuLinkCreateFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLinkCreate");

CuLinkCompleteFcnPtr =		CuLinkCompleteFcnPtr =
(CuLinkCompleteFcnTy *)getAPIHandle(HandleCuda, "cuLinkComplete");		(CuLinkCompleteFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLinkComplete");

CuLinkDestroyFcnPtr =		CuLinkDestroyFcnPtr =
(CuLinkDestroyFcnTy *)getAPIHandle(HandleCuda, "cuLinkDestroy");		(CuLinkDestroyFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLinkDestroy");

CuCtxSynchronizeFcnPtr =		CuCtxSynchronizeFcnPtr =
(CuCtxSynchronizeFcnTy *)getAPIHandle(HandleCuda, "cuCtxSynchronize");		(CuCtxSynchronizeFcnTy *)getAPIHandleCUDA(HandleCuda, "cuCtxSynchronize");

/* Get function pointer to CUDA Runtime APIs. */		/* Get function pointer to CUDA Runtime APIs. */
CudaThreadSynchronizeFcnPtr = (CudaThreadSynchronizeFcnTy *)getAPIHandle(		CudaThreadSynchronizeFcnPtr = (CudaThreadSynchronizeFcnTy *)getAPIHandleCUDA(
HandleCudaRT, "cudaThreadSynchronize");		HandleCudaRT, "cudaThreadSynchronize");

return 1;		return 1;
}		}

PollyGPUContext *polly_initContext() {		static PollyGPUContext *initContextCUDA() {
DebugMode = getenv("POLLY_DEBUG") != 0;

dump_function();		dump_function();
PollyGPUContext *Context;		PollyGPUContext *Context;
CUdevice Device;		CUdevice Device;

int Major = 0, Minor = 0, DeviceID = 0;		int Major = 0, Minor = 0, DeviceID = 0;
char DeviceName[256];		char DeviceName[256];
int DeviceCount = 0;		int DeviceCount = 0;

static __thread PollyGPUContext *CurrentContext = NULL;		static __thread PollyGPUContext *CurrentContext = NULL;

if (CurrentContext)		if (CurrentContext)
return CurrentContext;		return CurrentContext;

/* Get API handles. */		/* Get API handles. */
if (initialDeviceAPIs() == 0) {		if (initialDeviceAPIsCUDA() == 0) {
fprintf(stderr, "Getting the \"handle\" for the CUDA driver API failed.\n");		fprintf(stderr, "Getting the \"handle\" for the CUDA driver API failed.\n");
exit(-1);		exit(-1);
}		}

if (CuInitFcnPtr(0) != CUDA_SUCCESS) {		if (CuInitFcnPtr(0) != CUDA_SUCCESS) {
fprintf(stderr, "Initializing the CUDA driver API failed.\n");		fprintf(stderr, "Initializing the CUDA driver API failed.\n");
exit(-1);		exit(-1);
}		}
Show All 13 Lines	static PollyGPUContext *initContextCUDA() {
debug_print("> Running on GPU device %d : %s.\n", DeviceID, DeviceName);		debug_print("> Running on GPU device %d : %s.\n", DeviceID, DeviceName);

/* Create context on the device. */		/* Create context on the device. */
Context = (PollyGPUContext *)malloc(sizeof(PollyGPUContext));		Context = (PollyGPUContext *)malloc(sizeof(PollyGPUContext));
if (Context == 0) {		if (Context == 0) {
fprintf(stderr, "Allocate memory for Polly GPU context failed.\n");		fprintf(stderr, "Allocate memory for Polly GPU context failed.\n");
exit(-1);		exit(-1);
}		}
CuCtxCreateFcnPtr(&(Context->Cuda), 0, Device);		Context->Context = malloc(sizeof(CUDAContext));
		if (Context->Context == 0) {
CacheMode = getenv("POLLY_NOCACHE") == 0;		fprintf(stderr, "Allocate memory for Polly CUDA context failed.\n");
		exit(-1);
		}
		CuCtxCreateFcnPtr(&(((CUDAContext *)Context->Context)->Cuda), 0, Device);

if (CacheMode)		if (CacheMode)
CurrentContext = Context;		CurrentContext = Context;

return Context;		return Context;
}		}

static void freeKernel(PollyGPUFunction *Kernel) {		static void freeKernelCUDA(PollyGPUFunction *Kernel) {
if (Kernel->CudaModule)		dump_function();
CuModuleUnloadFcnPtr(Kernel->CudaModule);
		if (CacheMode)
		return;

		if (((CUDAKernel *)Kernel->Kernel)->CudaModule)
		CuModuleUnloadFcnPtr(((CUDAKernel *)Kernel->Kernel)->CudaModule);

		if (Kernel->Kernel)
		free((CUDAKernel *)Kernel->Kernel);

if (Kernel)		if (Kernel)
free(Kernel);		free(Kernel);
}		}

#define KERNEL_CACHE_SIZE 10		static PollyGPUFunction getKernelCUDA(const char BinaryBuffer,

PollyGPUFunction polly_getKernel(const char PTXBuffer,
const char *KernelName) {		const char *KernelName) {
dump_function();		dump_function();

static __thread PollyGPUFunction *KernelCache[KERNEL_CACHE_SIZE];		static __thread PollyGPUFunction *KernelCache[KERNEL_CACHE_SIZE];
static __thread int NextCacheItem = 0;		static __thread int NextCacheItem = 0;

for (long i = 0; i < KERNEL_CACHE_SIZE; i++) {		for (long i = 0; i < KERNEL_CACHE_SIZE; i++) {
// We exploit here the property that all Polly-ACC kernels are allocated		// We exploit here the property that all Polly-ACC kernels are allocated
// as global constants, hence a pointer comparision is sufficient to		// as global constants, hence a pointer comparision is sufficient to
// determin equality.		// determin equality.
if (KernelCache[i] && KernelCache[i]->PTXString == PTXBuffer) {		if (KernelCache[i] &&
		((CUDAKernel *)KernelCache[i]->Kernel)->BinaryString == BinaryBuffer) {
debug_print(" -> using cached kernel\n");		debug_print(" -> using cached kernel\n");
return KernelCache[i];		return KernelCache[i];
}		}
}		}

PollyGPUFunction *Function = malloc(sizeof(PollyGPUFunction));		PollyGPUFunction *Function = malloc(sizeof(PollyGPUFunction));

if (Function == 0) {		if (Function == 0) {
fprintf(stderr, "Allocate memory for Polly GPU function failed.\n");		fprintf(stderr, "Allocate memory for Polly GPU function failed.\n");
exit(-1);		exit(-1);
}		}
		Function->Kernel = (CUDAKernel *)malloc(sizeof(CUDAKernel));
		if (Function->Kernel == 0) {
		fprintf(stderr, "Allocate memory for Polly CUDA function failed.\n");
		exit(-1);
		}

CUresult Res;		CUresult Res;
CUlinkState LState;		CUlinkState LState;
CUjit_option Options[6];		CUjit_option Options[6];
void *OptionVals[6];		void *OptionVals[6];
float Walltime = 0;		float Walltime = 0;
unsigned long LogSize = 8192;		unsigned long LogSize = 8192;
char ErrorLog[8192], InfoLog[8192];		char ErrorLog[8192], InfoLog[8192];
Show All 18 Lines	static PollyGPUFunction getKernelCUDA(const char BinaryBuffer,
OptionVals[4] = (void *)LogSize;		OptionVals[4] = (void *)LogSize;
// Make the linker verbose		// Make the linker verbose
Options[5] = CU_JIT_LOG_VERBOSE;		Options[5] = CU_JIT_LOG_VERBOSE;
OptionVals[5] = (void *)1;		OptionVals[5] = (void *)1;

memset(ErrorLog, 0, sizeof(ErrorLog));		memset(ErrorLog, 0, sizeof(ErrorLog));

CuLinkCreateFcnPtr(6, Options, OptionVals, &LState);		CuLinkCreateFcnPtr(6, Options, OptionVals, &LState);
Res = CuLinkAddDataFcnPtr(LState, CU_JIT_INPUT_PTX, (void *)PTXBuffer,		Res = CuLinkAddDataFcnPtr(LState, CU_JIT_INPUT_PTX, (void *)BinaryBuffer,
strlen(PTXBuffer) + 1, 0, 0, 0, 0);		strlen(BinaryBuffer) + 1, 0, 0, 0, 0);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stderr, "PTX Linker Error:\n%s\n%s", ErrorLog, InfoLog);		fprintf(stderr, "PTX Linker Error:\n%s\n%s", ErrorLog, InfoLog);
exit(-1);		exit(-1);
}		}

Res = CuLinkCompleteFcnPtr(LState, &CuOut, &OutSize);		Res = CuLinkCompleteFcnPtr(LState, &CuOut, &OutSize);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stderr, "Complete ptx linker step failed.\n");		fprintf(stderr, "Complete ptx linker step failed.\n");
fprintf(stderr, "\n%s\n", ErrorLog);		fprintf(stderr, "\n%s\n", ErrorLog);
exit(-1);		exit(-1);
}		}

debug_print("CUDA Link Completed in %fms. Linker Output:\n%s\n", Walltime,		debug_print("CUDA Link Completed in %fms. Linker Output:\n%s\n", Walltime,
InfoLog);		InfoLog);

Res = CuModuleLoadDataFcnPtr(&(Function->CudaModule), CuOut);		Res = CuModuleLoadDataFcnPtr(&(((CUDAKernel *)Function->Kernel)->CudaModule),
		CuOut);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stderr, "Loading ptx assembly text failed.\n");		fprintf(stderr, "Loading ptx assembly text failed.\n");
exit(-1);		exit(-1);
}		}

Res = CuModuleGetFunctionFcnPtr(&(Function->Cuda), Function->CudaModule,		Res = CuModuleGetFunctionFcnPtr(&(((CUDAKernel *)Function->Kernel)->Cuda),
		((CUDAKernel *)Function->Kernel)->CudaModule,
KernelName);		KernelName);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stderr, "Loading kernel function failed.\n");		fprintf(stderr, "Loading kernel function failed.\n");
exit(-1);		exit(-1);
}		}

CuLinkDestroyFcnPtr(LState);		CuLinkDestroyFcnPtr(LState);

Function->PTXString = PTXBuffer;		((CUDAKernel *)Function->Kernel)->BinaryString = BinaryBuffer;

if (CacheMode) {		if (CacheMode) {
if (KernelCache[NextCacheItem])		if (KernelCache[NextCacheItem])
freeKernel(KernelCache[NextCacheItem]);		freeKernelCUDA(KernelCache[NextCacheItem]);

KernelCache[NextCacheItem] = Function;		KernelCache[NextCacheItem] = Function;

NextCacheItem = (NextCacheItem + 1) % KERNEL_CACHE_SIZE;		NextCacheItem = (NextCacheItem + 1) % KERNEL_CACHE_SIZE;
}		}

return Function;		return Function;
}		}

void polly_freeKernel(PollyGPUFunction *Kernel) {		static void synchronizeDeviceCUDA() {
dump_function();		dump_function();
		if (CuCtxSynchronizeFcnPtr() != CUDA_SUCCESS) {
if (CacheMode)		fprintf(stderr, "Synchronizing device and host memory failed.\n");
return;		exit(-1);
		}
freeKernel(Kernel);
}		}

void polly_copyFromHostToDevice(void HostData, PollyGPUDevicePtr DevData,		static void copyFromHostToDeviceCUDA(void HostData, PollyGPUDevicePtr DevData,
long MemSize) {		long MemSize) {
dump_function();		dump_function();

CUdeviceptr CuDevData = DevData->Cuda;		CUdeviceptr CuDevData = ((CUDADevicePtr *)DevData->DevicePtr)->Cuda;
CuMemcpyHtoDFcnPtr(CuDevData, HostData, MemSize);		CuMemcpyHtoDFcnPtr(CuDevData, HostData, MemSize);
}		}

void polly_copyFromDeviceToHost(PollyGPUDevicePtr DevData, void HostData,		static void copyFromDeviceToHostCUDA(PollyGPUDevicePtr DevData, void HostData,
long MemSize) {		long MemSize) {
dump_function();		dump_function();

if (CuMemcpyDtoHFcnPtr(HostData, DevData->Cuda, MemSize) != CUDA_SUCCESS) {		if (CuMemcpyDtoHFcnPtr(HostData, ((CUDADevicePtr *)DevData->DevicePtr)->Cuda,
		MemSize) != CUDA_SUCCESS) {
fprintf(stderr, "Copying results from device to host memory failed.\n");		fprintf(stderr, "Copying results from device to host memory failed.\n");
exit(-1);		exit(-1);
}		}
}		}
void polly_synchronizeDevice() {
dump_function();
if (CuCtxSynchronizeFcnPtr() != CUDA_SUCCESS) {
fprintf(stderr, "Synchronizing device and host memory failed.\n");
exit(-1);
}
}

void polly_launchKernel(PollyGPUFunction *Kernel, unsigned int GridDimX,		static void launchKernelCUDA(PollyGPUFunction *Kernel, unsigned int GridDimX,
unsigned int GridDimY, unsigned int BlockDimX,		unsigned int GridDimY, unsigned int BlockDimX,
unsigned int BlockDimY, unsigned int BlockDimZ,		unsigned int BlockDimY, unsigned int BlockDimZ,
void **Parameters) {		void **Parameters) {
dump_function();		dump_function();

unsigned GridDimZ = 1;		unsigned GridDimZ = 1;
unsigned int SharedMemBytes = CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE;		unsigned int SharedMemBytes = CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE;
CUstream Stream = 0;		CUstream Stream = 0;
void **Extra = 0;		void **Extra = 0;

CUresult Res;		CUresult Res;
Res = CuLaunchKernelFcnPtr(Kernel->Cuda, GridDimX, GridDimY, GridDimZ,		Res =
BlockDimX, BlockDimY, BlockDimZ, SharedMemBytes,		CuLaunchKernelFcnPtr(((CUDAKernel *)Kernel->Kernel)->Cuda, GridDimX,
Stream, Parameters, Extra);		GridDimY, GridDimZ, BlockDimX, BlockDimY, BlockDimZ,
		SharedMemBytes, Stream, Parameters, Extra);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stderr, "Launching CUDA kernel failed.\n");		fprintf(stderr, "Launching CUDA kernel failed.\n");
exit(-1);		exit(-1);
}		}
}		}

void polly_freeDeviceMemory(PollyGPUDevicePtr *Allocation) {		static void freeDeviceMemoryCUDA(PollyGPUDevicePtr *Allocation) {
dump_function();		dump_function();
CuMemFreeFcnPtr((CUdeviceptr)Allocation->Cuda);		CUDADevicePtr DevPtr = (CUDADevicePtr )Allocation->DevicePtr;
		CuMemFreeFcnPtr((CUdeviceptr)DevPtr->Cuda);
		free(DevPtr);
free(Allocation);		free(Allocation);
}		}

PollyGPUDevicePtr *polly_allocateMemoryForDevice(long MemSize) {		static PollyGPUDevicePtr *allocateMemoryForDeviceCUDA(long MemSize) {
dump_function();		dump_function();

PollyGPUDevicePtr *DevData = malloc(sizeof(PollyGPUDevicePtr));		PollyGPUDevicePtr *DevData = malloc(sizeof(PollyGPUDevicePtr));

if (DevData == 0) {		if (DevData == 0) {
fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
exit(-1);		exit(-1);
}		}
		DevData->DevicePtr = (CUDADevicePtr *)malloc(sizeof(CUDADevicePtr));
		if (DevData->DevicePtr == 0) {
		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
		exit(-1);
		}

CUresult Res = CuMemAllocFcnPtr(&(DevData->Cuda), MemSize);		CUresult Res =
		CuMemAllocFcnPtr(&(((CUDADevicePtr *)DevData->DevicePtr)->Cuda), MemSize);

if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
exit(-1);		exit(-1);
}		}

return DevData;		return DevData;
}		}

		static void getDevicePtrCUDA(PollyGPUDevicePtr Allocation) {
		dump_function();

		CUDADevicePtr DevPtr = (CUDADevicePtr )Allocation->DevicePtr;
		return (void *)DevPtr->Cuda;
		}

		static void freeContextCUDA(PollyGPUContext *Context) {
		dump_function();

		CUDAContext Ctx = (CUDAContext )Context->Context;
		if (Ctx->Cuda) {
		CuCtxDestroyFcnPtr(Ctx->Cuda);
		free(Ctx);
		free(Context);
		}

		dlclose(HandleCuda);
		dlclose(HandleCudaRT);
		}

		#endif /* HAS_LIBCUDART */
		/******************************************************************************/
		/* API */
		/******************************************************************************/

		PollyGPUContext *polly_initContext() {
		DebugMode = getenv("POLLY_DEBUG") != 0;
		CacheMode = getenv("POLLY_NOCACHE") == 0;

		dump_function();

		PollyGPUContext *Context;

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		Context = initContextCUDA();
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		Context = initContextCL();
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}

		return Context;
		}

		void polly_freeKernel(PollyGPUFunction *Kernel) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		freeKernelCUDA(Kernel);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		freeKernelCL(Kernel);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		PollyGPUFunction polly_getKernel(const char BinaryBuffer,
		const char *KernelName) {
		dump_function();

		PollyGPUFunction *Function;

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		Function = getKernelCUDA(BinaryBuffer, KernelName);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		Function = getKernelCL(BinaryBuffer, KernelName);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}

		return Function;
		}

		void polly_copyFromHostToDevice(void HostData, PollyGPUDevicePtr DevData,
		long MemSize) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		copyFromHostToDeviceCUDA(HostData, DevData, MemSize);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		copyFromHostToDeviceCL(HostData, DevData, MemSize);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		void polly_copyFromDeviceToHost(PollyGPUDevicePtr DevData, void HostData,
		long MemSize) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		copyFromDeviceToHostCUDA(DevData, HostData, MemSize);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		copyFromDeviceToHostCL(DevData, HostData, MemSize);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		void polly_launchKernel(PollyGPUFunction *Kernel, unsigned int GridDimX,
		unsigned int GridDimY, unsigned int BlockDimX,
		unsigned int BlockDimY, unsigned int BlockDimZ,
		void **Parameters) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		launchKernelCUDA(Kernel, GridDimX, GridDimY, BlockDimX, BlockDimY,
		BlockDimZ, Parameters);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		launchKernelCL(Kernel, GridDimX, GridDimY, BlockDimX, BlockDimY, BlockDimZ,
		Parameters);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		void polly_freeDeviceMemory(PollyGPUDevicePtr *Allocation) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		freeDeviceMemoryCUDA(Allocation);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		freeDeviceMemoryCL(Allocation);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		PollyGPUDevicePtr *polly_allocateMemoryForDevice(long MemSize) {
		dump_function();

		PollyGPUDevicePtr *DevData;

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		DevData = allocateMemoryForDeviceCUDA(MemSize);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		DevData = allocateMemoryForDeviceCL(MemSize);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}

		return DevData;
		}

void polly_getDevicePtr(PollyGPUDevicePtr Allocation) {		void polly_getDevicePtr(PollyGPUDevicePtr Allocation) {
dump_function();		dump_function();

return (void *)Allocation->Cuda;		void *DevPtr;

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		DevPtr = getDevicePtrCUDA(Allocation);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		DevPtr = getDevicePtrCL(Allocation);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}

		return DevPtr;
		}

		void polly_synchronizeDevice() {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		synchronizeDeviceCUDA();
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		synchronizeDeviceCL();
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
}		}

void polly_freeContext(PollyGPUContext *Context) {		void polly_freeContext(PollyGPUContext *Context) {
dump_function();		dump_function();

if (CacheMode)		if (CacheMode)
return;		return;

if (Context->Cuda) {		switch (Runtime) {
CuCtxDestroyFcnPtr(Context->Cuda);		#ifdef HAS_LIBCUDART
free(Context);		case RUNTIME_CUDA:
		freeContextCUDA(Context);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		freeContextCL(Context);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
}		}

dlclose(HandleCuda);		/* Initialize GPUJIT with CUDA as runtime library. */
dlclose(HandleCudaRT);		PollyGPUContext *polly_initContextCUDA() {
		#ifdef HAS_LIBCUDART
		Runtime = RUNTIME_CUDA;
		return polly_initContext();
		#else
		fprintf(stderr, "GPU Runtime was built without CUDA support.\n");
		exit(-1);
		#endif /* HAS_LIBCUDART */
		}

		/* Initialize GPUJIT with OpenCL as runtime library. */
		PollyGPUContext *polly_initContextCL() {
		#ifdef HAS_LIBOPENCL
		Runtime = RUNTIME_CL;
		return polly_initContext();
		#else
		fprintf(stderr, "GPU Runtime was built without OpenCL support.\n");
		exit(-1);
		#endif /* HAS_LIBOPENCL */
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGenClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 98109

polly/trunk/CMakeLists.txt

polly/trunk/include/polly/CodeGen/PPCGCodeGeneration.h

polly/trunk/include/polly/LinkAllPasses.h

polly/trunk/lib/CodeGen/PPCGCodeGeneration.cpp

polly/trunk/lib/Support/RegisterPasses.cpp

polly/trunk/test/GPGPU/cuda-managed-memory-simple.ll

polly/trunk/test/GPGPU/size-cast.ll

polly/trunk/tools/CMakeLists.txt

polly/trunk/tools/GPURuntime/GPUJIT.h

polly/trunk/tools/GPURuntime/GPUJIT.c

[Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
ClosedPublic