This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
CMakeLists.txt
-
include/polly/
-
polly/
-
CodeGen/
-
PPCGCodeGeneration.h
5/5
LinkAllPasses.h
-
lib/
-
CodeGen/
27/27
PPCGCodeGeneration.cpp
-
Support/
5/5
RegisterPasses.cpp
-
test/GPGPU/
-
GPGPU/
2
cuda-managed-memory-simple.ll
-
size-cast.ll
-
tools/
-
CMakeLists.txt
-
GPURuntime/
-
GPUJIT.h
7/11
GPUJIT.c

Differential D32431

[Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
ClosedPublic

Authored by PhilippSchaad on Apr 24 2017, 6:39 AM.

Download Raw Diff

Details

Reviewers

grosser
bollu
Meinersbur
etherzhhb
singam-sanjay

Commits

rG17f01968f118: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rG51904ae35aad: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rPLO302379: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rPLO302215: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rL302379: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
rL302215: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen

Summary

When compiling for GPU, one can now choose to compile for OpenCL or CUDA,
with the corresponding polly-gpu-runtime flag (libopencl / libcudart). The
GPURuntime library (GPUJIT) has been extended with the OpenCL Runtime library
for that purpose, correctly choosing the corresponding library calls to the
option chosen when compiling (via different initialization calls).

Additionally, a specific GPU Target architecture can now be chosen with -polly-gpu-arch (only nvptx64 implemented thus far).

Diff Detail

Build Status

Buildable 6145
Build 6145: arc lint + arc unit

Event Timeline

PhilippSchaad created this revision.Apr 24 2017, 6:39 AM

Herald added subscribers: Anastasia, yaxunl, mgorny, nemanjai. · View Herald TranscriptApr 24 2017, 6:39 AM

PhilippSchaad retitled this revision from Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen to [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen.Apr 24 2017, 6:44 AM

PhilippSchaad set the repository for this revision to rL LLVM.

PhilippSchaad added a project: Restricted Project.

PhilippSchaad added subscribers: pollydev, llvm-commits.

Replaced magic numbers, added assertions and fixed if-braces.

PhilippSchaad added reviewers: Meinersbur, etherzhhb.Apr 25 2017, 12:39 AM

I wrote a runtime with similar scope here: https://github.com/Meinersbur/prl . We were one discussing to use it for Polly as well. What's the status of that?

lib/CodeGen/PPCGCodeGeneration.cpp
57–59	Did you consider an enum?
1694	Is there some vendor-neutral triple?
1807	Why a static flag?
2683–2689	Did you consider Pass *polly::createPPCGCodeGenerationPass(int Runtime); ?
lib/Support/RegisterPasses.cpp
328–332	A switch instead?

In D32431#736600, @Meinersbur wrote:

I wrote a runtime with similar scope here: https://github.com/Meinersbur/prl . We were one discussing to use it for Polly as well. What's the status of that?

I have looked into it a tiny little bit about a month ago, but had then decided to write a basic OpenCL Runtime from scratch in GPUJIT. So to my knowledge, nothing has changed on that status yet.

Currently looking into the rest of your comment-mentioned points.

lib/CodeGen/PPCGCodeGeneration.cpp
1694	Do you mean like `nvptx64-nvcl` / `nvptx64-cuda`?

PhilippSchaad added inline comments.Apr 25 2017, 9:38 AM

lib/CodeGen/PPCGCodeGeneration.cpp
2683–2689	That seems reasonable, but I get a template-conflict for the LLVM Pass-Creation template when trying to change the pass-creation-method structure. I thought it might be easier this way?

PhilippSchaad added inline comments.Apr 25 2017, 9:45 AM

lib/CodeGen/PPCGCodeGeneration.cpp
2683–2689	Correction: looking at wrong function of course, you mean a different one :-)

Stylistic changes and switch to -polly-gpu-runtime=cuda/opencl compiler flag

PhilippSchaad marked 6 inline comments as done.Apr 25 2017, 12:20 PM

Removed left over commented out macros

Meinersbur added inline comments.Apr 25 2017, 1:36 PM

lib/CodeGen/PPCGCodeGeneration.cpp
59–61	See http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly for LLVM's coding policy for enum members. Nitpick: A "T" suffix is rather unusual.
157	Nitpick: No need to use an `enum` qualifier.
1694	I hoped that there might be some kind of triple that works for OpenCL in general, not only for nvidia (`nvptx`, `nvcl`). If the generated program only works for devices that support cuda anyway, I don't see where the benefit of such a backend is. If there is indeed no backend that also works on non-nvidia devices, should we call the the runtime accordingly, e.g. "nvcl" then?
lib/Support/RegisterPasses.cpp
328–332	Now that `createPPCGCodeGenerationPass` takes an argument, you don't need a switch anymore.
tools/GPURuntime/GPUJIT.c
308–316	Consistent variable name style? What style do you intend to use in this file?
352–353	Replace the magic number 256 by `sizeof(DeviceRevision)`?

etherzhhb added inline comments.Apr 25 2017, 5:12 PM

include/polly/LinkAllPasses.h
52–53	is this Runtime supposed to be with type GPURuntimeT ? it is a little bit tricky here. Maybe we need to introduce a PPCG header and define the runtime enum there, than include that runtime enum. or we can declare the function as llvm::Pass *createPPCGCodeGenerationPass(int Runtime = 0); to at least avoid the magic number 0 in line 86.
lib/CodeGen/PPCGCodeGeneration.cpp
1694	for opencl, it can be "spir-unknown-unknown" or "spir64-unknown-unknown", but that may not work :)

Looking into the rest of your comments.

include/polly/LinkAllPasses.h
52–53	Yes, it would be. The reason it's not is exactly the one you mentioned. I was considering adding a PPCG header, but refrained from it because I was hesitant about creating a header 'just for one enum'. If you agree that this is a good solution, I will indeed introduce a new header for PPCG and define the enum there, to get rid of magic numbers. The second option seems reasonable too though.
lib/CodeGen/PPCGCodeGeneration.cpp
1694	Looking into it. The next goal would be to add the AMDGPU backend to generate AMD ISA, which would then again utilize the same OpenCL Runtime implemented here. (I realize there will have to be some naming changes to make that clear in the `GPUJIT`, but as you pointed out, I have a naming-mess to fix there anyway.

etherzhhb added inline comments.Apr 26 2017, 12:12 AM

include/polly/LinkAllPasses.h
52–53	we could start from the second option if you think it is reasonable

Addressed consistency and naming concerns

PhilippSchaad marked 7 inline comments as done.Apr 26 2017, 3:15 AM

PhilippSchaad edited the summary of this revision. (Show Details)Apr 26 2017, 3:19 AM

Made CUDA Runtime default, fixed formatting, adapted test case

Hi Philip and others,

this already looks very cool. I also added some minor comments.

Best,
Tobias

lib/CodeGen/PPCGCodeGeneration.cpp
57–59	You can use C++11 enums ala enum class GPURuntime { CUDA, OpenCL };
1694	Making OpenCL work for CUDA is just the first step. I expect that when adding AMDGPU support, we will use here different triples depending on which vendor to target. AMD will have a specific one, CUDA will have a specific one, and for Intel we likely use the generic SPIR-V comment. I assume this could then also work for Xilinx.

Fixed enum style to C++11

PhilippSchaad marked an inline comment as done.Apr 27 2017, 5:03 AM

Meinersbur added inline comments.Apr 27 2017, 6:55 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1694	At compile time, we don't know on which hardware it will run on, so we cannot specify a triple here. Unless you think of a runtime dispatch system, then you need to generate all kernels at once. In that case, I still would like to select a single target only for when I know I will run only on that hardware and to keep the executable small.

PhilippSchaad added inline comments.Apr 27 2017, 7:49 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1694	I thought the goal was to let the user compile for a specific target, i.e. providing something like -polly-gpu-arch=amd/nvidia/intel, and then choosing the correct target triple according to said selection. Meaning for example -polly-gpu-arch=amd would utilize the AMDGPU backend triple and feed that into the OpenCL runtime. Am I misunderstanding something?

Meinersbur added inline comments.Apr 27 2017, 8:48 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1694	I think we were miscommunicating. The -polly-gpu-arch switch is new to me and doesn't appear in this patch. I assumed a fat executable when you mentioned an AMD backend. OpenCL claims to be hardware-independent with platform's driver translating OpenCL-C or SPIR(-V) to its proprietary format. In NVidia's terminology, CUDA is a platform of which CUDA C++, their OpenCL implementation, cudart (CUDA runtime) etc. are part of. We are still vendor-locked to CUDA, since it only works with CUDA's OpenCL runtime library. -polly-gpu-runtime=opencl therefore is misleading (at least it was to me), it it no alternative to CUDA. It might resolve if it is indeed just the runtime library GPUJIT is linked to. If so, could you make it more clear? I suggest the following switches: -polly-target=cpu/gpu if -polly-target=gpu then -polly-gpu-arch=nvptx64/hsa/spir/spir-v/opencl-c (with -polly-gpu-arch=nvptx64 the only one implemented so far) if -polly-gpu-arch=nvptx64 then there is a choice between -polly-cuda-runtime=libcudart/libopencl

GPURuntime works on systems with just one of CUDA/OpenCL now.

Harbormaster completed remote builds in B6005: Diff 97189.Apr 29 2017, 6:38 AM

PhilippSchaad added inline comments.Apr 29 2017, 6:40 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1694	This change should address exactly this. The framework is now set to introduce new architectures and utilize eg. the AMDGPU backend instead of NVPTX etc.

singam-sanjay added a subscriber: singam-sanjay.Apr 29 2017, 7:37 AM

singam-sanjay added inline comments.

lib/CodeGen/PPCGCodeGeneration.cpp
1728	Does `nvptx64-nvidia-nvcl` mean OpenCL code meant to be run on NVIDIA GPUs ?

PhilippSchaad added inline comments.Apr 29 2017, 7:40 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1728	Yes, exactly. It generates a slightly different flavor of PTX, which can be used by OpenCL to generate a kernel from the PTX binary (on NVIDIA GPUs). If you were to use the standard CUDA PTX, OpenCL would complain because of wrong argument accesses.

PhilippSchaad edited the summary of this revision. (Show Details)Apr 29 2017, 8:14 AM

PhilippSchaad set the repository for this revision to rL LLVM.

singam-sanjay added inline comments.Apr 29 2017, 9:33 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1728	Okay. From what you're saying, `nvptx64-nvidia-nvcl` indicates that backend must generate NVPTX code for a 64bit architecture for an NVIDIA GPU controlled by a OpenCL driver. Please correct me if I'm wrong.

PhilippSchaad added inline comments.Apr 29 2017, 9:37 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1728	That is correct.

singam-sanjay added inline comments.Apr 29 2017, 11:25 PM

lib/CodeGen/PPCGCodeGeneration.cpp
1728	Thank you ! That was helpful.

PhilippSchaad added a reviewer: singam-sanjay.Apr 30 2017, 3:42 AM

Integrated D32226 - Managed memory support

@grosser @Meinersbur ping

Fixed formatting and managed-memory test case (including pre-existing bug)

I only consider the clSetKernelArg as a remaining bigger issue. Having only "polly_"-prefixed function non-static would also be great.

Tobias is the sole author of Polly-ACC. I think he should give the final LGTM.

include/polly/LinkAllPasses.h
52–53	I assume you kept the arguments of type `int` to not include header files here.
lib/CodeGen/PPCGCodeGeneration.cpp
764–767	Can you make this a switch so you get warned by the compiler when adding more runtimes?
1725–1728	Could also be a switch.
2692	We should check whether the arguments are valid. Such as: switch (Runtime) { case 1: case 2: ... default: llvm_unreachable("Invalid argument for Runtime"); }
2694–2698	Similarly: switch (Arch) { case 1: ... default: llvm_unreachable("Invalid argument for Arch"); }
lib/Support/RegisterPasses.cpp
330–332	int Arch; switch (GPUArch) { case GPU_ARCH_NVPTX64; Arch = 1; break; } int Runtime; switch (GPURuntime) { case GPU_RUNTIME_CUDA: Runtime = 1; break; case GPU_RUNTIME_OPENCL: Runtime = 2; breal } PM.add(polly::createPPCGCodeGenerationPass(Arch, Runtime)) With "you don't need a switch anymore", I was thinking about the like of: PM.add(polly::createPPCGCodeGenerationPass(1, GPURuntime + 1)); Your choice. It could be helpful to have `createPPCGCodeGenerationPass` accept the GPURuntime enum as arguments instead. In your solution, I don't see the use of `static const int` local variables. If you want identifiers that give names to the accepted arguments, declare them as `#define` or `static const int` in the header file that also declares `createPPCGCodeGenerationPass`, so these can be used in the implementation of `createPPCGCodeGenerationPass` as well.
tools/GPURuntime/GPUJIT.c
397–401	Did you consider introducing a new function this sequence of code? It appears quite often.
610–618	Trying each argument size after the other and hoping one matches is not good. The caller must know the argument sizes. You probably have to pass the sizes in another argument to `launchKernelCL` that contains those sizes for each argument, generated by Polly. Without this, the code will fail if you pass a struct (or vector) of size other than 8, 4, 2, or 1.
676	Shouldn't these print to `stderr`?
750	The function name does not follow the naming of other functions in this file. In C it is common have the public API functions prefixed with the library name (here: "polly") and everything else static. Don't choose the prefix of another library (here: "cl_"). This avoids symbol conflicts because multiple libraries happen to give the same name for a function.

PhilippSchaad marked 17 inline comments as done.May 2 2017, 1:13 PM

PhilippSchaad added inline comments.

include/polly/LinkAllPasses.h
52–53	That is correct, yes.
lib/Support/RegisterPasses.cpp
330–332	With "you don't need a switch anymore", I was thinking about the like of: PM.add(polly::createPPCGCodeGenerationPass(1, GPURuntime + 1)); Your choice. Personally, I think the current solution is tiny little bit more 'documenting'. Both is fine though, good call. It could be helpful to have createPPCGCodeGenerationPass accept the GPURuntime enum as arguments instead. That is true, would mean having to provide a header with that GPURuntime enum instead though, right? In your solution, I don't see the use of static const int local variables. If you want identifiers that give names to the accepted arguments, declare them as #define or static const int in the header file that also declares createPPCGCodeGenerationPass, so these can be used in the implementation of createPPCGCodeGenerationPass as well. Would it maybe make sense to introduce a PPCGCodeGeneration header at this point?
tools/GPURuntime/GPUJIT.c
610–618	Yes, this is a priority issue still. The issue will have to be resolved at some point. This is basically a temporary way around some (probably) major argument handling changes in PPCG etc.

Addressed multiple issues pointed out in comment

Fixed formatting

You changed stdout to stderr everywhere, which is better in my point of view, but logically is a different change. Sorry that I didn't realize that the libcudart also printed to stdout before so you tried to be consistent. Could you commit that change separately beforehand? (Maybe also the change in argument name capitalization)

I am accepting the patch proved that you are going to improve the clSetKernelArg situation later and add a TODO into the code about it. The other stuff is of stylistic nature only.

Please also wait for Tobias' approval.

lib/Support/RegisterPasses.cpp
330–332	Please remove at least the `static` keyword. It makes sense for global constants, but not for function-local ones. The style static const int ArgumentName = 0; func(ArgumentName); rather unusual in LLVM-style code (but not bad if applied consistenly, which is unfortunately not the case in Polly). I've seen func(/* ArgumentName = */ 0); much more often. In this case I think `UseOpenCLRuntime`, `UseCUDARuntime` and `TargetNVPTX64` should really be global constants declated close to the declaration of `createPPCGCodeGenerationPass` so it can be used by ever caller of that function. Would it maybe make sense to introduce a PPCGCodeGeneration header at this point? Yes, that sounds good to me as well.
tools/GPURuntime/GPUJIT.c
606–607	Thanks for the introduction of `checkOpenCLError`. You could also introduce one for these two lines. for instance: if (!GlobalContext) handleError("GPGPU-code generation not correctly initialized.\n"); `handleError` could also be called by `checkOpenCLError`. It helps centralising the error handling, such that if we change some detail about it (e.g. the return code on exit, or some cleanup code), there is a single function for that.
610–618	Also note that I am not sure that OpenCL ICD's are required to check for correct `CL_INVALID_ARG_SIZE`. It might just trust the caller, or be a badly written one.
961–967	These are unrelated changes Tobias usually complains about. I personally don't care.
1101	Unrelated whitespace change?

This revision is now accepted and ready to land.May 4 2017, 2:41 AM

Addressed most of your concerns. @grosser it should be ready now, what do you think?

Introduced PPCGCodeGeneration header file for simplicity

Harbormaster completed remote builds in B6145: Diff 97804.May 4 2017, 3:45 AM

@Meinersbur the unrelated changes you mentioned have been added/moved to D32852 and D32854.

grosser accepted this revision.May 4 2017, 3:52 AM

grosser added inline comments.

test/GPGPU/cuda-managed-memory-simple.ll
49	This change is unrelated.

PhilippSchaad added inline comments.May 4 2017, 3:55 AM

test/GPGPU/cuda-managed-memory-simple.ll
49	It is, but it got fixed in the meantime anyway. Removing it.

Closed by commit rL302215: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen (authored by bollu). · Explain WhyMay 5 2017, 1:08 AM

This revision was automatically updated to reflect the committed changes.

bollu mentioned this in rL302217: Revert "[Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen".May 5 2017, 2:15 AM

Reopened for rebase

This revision is now accepted and ready to land.May 7 2017, 2:36 AM

Rebase

Closed by commit rL302379: [Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen (authored by bollu). · Explain WhyMay 7 2017, 2:17 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

CMakeLists.txt

10 lines

include/

polly/

CodeGen/

PPCGCodeGeneration.h

24 lines

LinkAllPasses.h

4 lines

lib/

CodeGen/

PPCGCodeGeneration.cpp

113 lines

Support/

RegisterPasses.cpp

21 lines

test/

GPGPU/

cuda-managed-memory-simple.ll

4 lines

size-cast.ll

2 lines

tools/

CMakeLists.txt

4 lines

GPURuntime/

GPUJIT.h

19 lines

GPUJIT.c

1383 lines

Commit	Tree	Parents	Author	Summary	Date
dc4cf71e6d39	7b1d71da76c0	d27a2ad8424d	Philipp Schaad	Introduced PPCGCodeGeneration header file for simplicity	May 4 2017, 3:44 AM
d27a2ad8424d	c5607b2ae56f	99c94d949621 6001e9cca31f	Philipp Schaad	Merge branch 'master' of http://llvm.org/git/polly into GPGPU_CL_Runtime	May 3 2017, 1:26 PM
99c94d949621	166d5f2fffcf	5eeebfdadfb8 5d521a3ae86b	Philipp Schaad	Merged	May 3 2017, 1:25 PM
5eeebfdadfb8	8c3341f0ba8a	c14aa3c7c652	Philipp Schaad	Fixed formatting	May 2 2017, 1:36 PM
c14aa3c7c652	23bbbd6342a8	7a10e7e36eef	Philipp Schaad	Addressed multiple issues pointed out in comment	May 2 2017, 1:13 PM
7a10e7e36eef	5829a45f8508	c0854d424602	Philipp Schaad	Fixed formatting and managed-memory test case (including pre-existing bug)	May 2 2017, 5:53 AM
c0854d424602	081be745d840	5765f28cc97f 3e9514a818d9	Philipp Schaad	Merge branch 'master' of http://llvm.org/git/polly into GPGPU_CL_Runtime	May 2 2017, 1:17 AM
5765f28cc97f	d9c58acf2024	6f645f66a5d1	Philipp Schaad	Integrated D32226 - Managed memory support	Apr 30 2017, 3:47 AM
6f645f66a5d1	01d3fe751f8f	09c0a6a276e6 8d064625e370	Philipp Schaad	Merge branch 'master' of http://llvm.org/git/polly into GPGPU_CL_Runtime	Apr 30 2017, 3:28 AM
09c0a6a276e6	480d6f301701	bb8dca160a3d	Philipp Schaad	GPURuntime works on systems with just one of CUDA/OpenCL now.	Apr 29 2017, 6:37 AM
bb8dca160a3d	f31aed45e684	652925f4c47f 50ec033305f1	Philipp Schaad	Merge branch 'master' of http://llvm.org/git/polly into GPGPU_CL_Runtime	Apr 27 2017, 9:41 AM
652925f4c47f	87573788f1df	b4fa3409f558	Philipp Schaad	Fixed enum style to C++11	Apr 27 2017, 5:02 AM
b4fa3409f558	8c7699a47b9d	436d33ed5f77	Philipp Schaad	Made CUDA Runtime default, fixed formatting, adapted test case	Apr 26 2017, 4:43 AM
436d33ed5f77	ee72c7640460	054b8d7f3ec1	Philipp Schaad	Addressed consistency and naming concerns	Apr 26 2017, 3:06 AM
054b8d7f3ec1	f90bd19ef0dd	1caca697a6d6	Philipp Schaad	Removed left over commented out macros	Apr 25 2017, 12:23 PM
1caca697a6d6	81ffcff6e66e	433bbb1d88ed	Philipp Schaad	Stylistic changes and switch to -polly-gpu-runtime=cuda/opencl compiler flag (Show More…)	Apr 25 2017, 12:16 PM
433bbb1d88ed	5df45e71b206	3f88fe9e76b8 30623ffa756d	Philipp Schaad	Merge branch 'master' of http://llvm.org/git/polly into GPGPU_CL_Runtime	Apr 25 2017, 8:15 AM
3f88fe9e76b8	246c32a0a20e	103508186187	Philipp Schaad	Replaced magic numbers, added assertions and fixed if-braces.	Apr 24 2017, 7:15 AM
103508186187	dbc12905e772	ae6794852ba6	Philipp Schaad	Fixed formatting mistakes.	Apr 24 2017, 6:11 AM
ae6794852ba6	41f18542c123	d5bd2322fad5	Philipp Schaad	Fixed GPURuntime requiring -lOpenCL	Apr 24 2017, 5:15 AM
d5bd2322fad5	0dd1c5a5221d	d43b863664bb	Philipp Schaad	Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen (Show More…)	Apr 24 2017, 3:05 AM

Diff 97804

CMakeLists.txt

	Show First 20 Lines • Show All 146 Lines • ▼ Show 20 Lines

	# Add path for custom modules			# Add path for custom modules
	set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${POLLY_SOURCE_DIR}/cmake")			set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${POLLY_SOURCE_DIR}/cmake")

	SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)			SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)

	option(POLLY_ENABLE_GPGPU_CODEGEN "Enable GPGPU code generation feature" OFF)			option(POLLY_ENABLE_GPGPU_CODEGEN "Enable GPGPU code generation feature" OFF)
	if (POLLY_ENABLE_GPGPU_CODEGEN)			if (POLLY_ENABLE_GPGPU_CODEGEN)
	# Do not require CUDA, as GPU code generation test cases can be run without			# Do not require CUDA/OpenCL, as GPU code generation test cases can be run
	# a cuda library.			# without a CUDA/OpenCL library.
	FIND_PACKAGE(CUDA)			FIND_PACKAGE(CUDA)
				FIND_PACKAGE(OpenCL)
	set(GPU_CODEGEN TRUE)			set(GPU_CODEGEN TRUE)
	else(POLLY_ENABLE_GPGPU_CODEGEN)			else(POLLY_ENABLE_GPGPU_CODEGEN)
	set(GPU_CODEGEN FALSE)			set(GPU_CODEGEN FALSE)
	endif(POLLY_ENABLE_GPGPU_CODEGEN)			endif(POLLY_ENABLE_GPGPU_CODEGEN)


	# Support GPGPU code generation if the library is available.			# Support GPGPU code generation if the library is available.
	if (CUDALIB_FOUND)			if (CUDALIB_FOUND)
				add_definitions(-DHAS_LIBCUDART)
	INCLUDE_DIRECTORIES( ${CUDALIB_INCLUDE_DIR} )			INCLUDE_DIRECTORIES( ${CUDALIB_INCLUDE_DIR} )
	endif(CUDALIB_FOUND)			endif(CUDALIB_FOUND)
				if (OpenCL_FOUND)
				add_definitions(-DHAS_LIBOPENCL)
				INCLUDE_DIRECTORIES( ${OpenCL_INCLUDE_DIR} )
				endif(OpenCL_FOUND)

	option(POLLY_BUNDLED_ISL "Use the bundled version of libisl included in Polly" ON)			option(POLLY_BUNDLED_ISL "Use the bundled version of libisl included in Polly" ON)
	if (NOT POLLY_BUNDLED_ISL)			if (NOT POLLY_BUNDLED_ISL)
	find_package(ISL MODULE REQUIRED)			find_package(ISL MODULE REQUIRED)
	message(STATUS "Using external libisl ${ISL_VERSION} in: ${ISL_PREFIX}")			message(STATUS "Using external libisl ${ISL_VERSION} in: ${ISL_PREFIX}")
	set(ISL_TARGET ISL)			set(ISL_TARGET ISL)
	else()			else()
	set(ISL_INCLUDE_DIRS			set(ISL_INCLUDE_DIRS
	▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

include/polly/CodeGen/PPCGCodeGeneration.h

This file was added.

				//===--- polly/PPCGCodeGeneration.h - Polly Accelerator Code Generation. --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// Take a scop created by ScopInfo and map it to GPU code using the ppcg
				// GPU mapping strategy.
				//
				//===----------------------------------------------------------------------===//

				#ifndef POLLY_PPCGCODEGENERATION_H
				#define POLLY_PPCGCODEGENERATION_H

				/// The GPU Architecture to target.
				enum GPUArch { NVPTX64 };

				/// The GPU Runtime implementation to use.
				enum GPURuntime { CUDA, OpenCL };

				#endif // POLLY_PPCGCODEGENERATION_H

include/polly/LinkAllPasses.h

	Show All 9 Lines
	// This header file pulls in all transformation and analysis passes for tools			// This header file pulls in all transformation and analysis passes for tools
	// like opt and bugpoint that need this functionality.			// like opt and bugpoint that need this functionality.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef POLLY_LINKALLPASSES_H			#ifndef POLLY_LINKALLPASSES_H
	#define POLLY_LINKALLPASSES_H			#define POLLY_LINKALLPASSES_H

				#include "polly/CodeGen/PPCGCodeGeneration.h"
	#include "polly/Config/config.h"			#include "polly/Config/config.h"
	#include "polly/PruneUnprofitable.h"			#include "polly/PruneUnprofitable.h"
	#include "polly/Simplify.h"			#include "polly/Simplify.h"
	#include "polly/Support/DumpModulePass.h"			#include "polly/Support/DumpModulePass.h"
	#include "llvm/ADT/StringRef.h"			#include "llvm/ADT/StringRef.h"
	#include <cstdlib>			#include <cstdlib>

	namespace llvm {			namespace llvm {
	Show All 17 Lines
	llvm::Pass *createPollyCanonicalizePass();			llvm::Pass *createPollyCanonicalizePass();
	llvm::Pass *createPolyhedralInfoPass();			llvm::Pass *createPolyhedralInfoPass();
	llvm::Pass *createScopDetectionPass();			llvm::Pass *createScopDetectionPass();
	llvm::Pass *createScopInfoRegionPassPass();			llvm::Pass *createScopInfoRegionPassPass();
	llvm::Pass *createScopInfoWrapperPassPass();			llvm::Pass *createScopInfoWrapperPassPass();
	llvm::Pass *createIslAstInfoPass();			llvm::Pass *createIslAstInfoPass();
	llvm::Pass *createCodeGenerationPass();			llvm::Pass *createCodeGenerationPass();
	#ifdef GPU_CODEGEN			#ifdef GPU_CODEGEN
	llvm::Pass *createPPCGCodeGenerationPass();			llvm::Pass *createPPCGCodeGenerationPass(GPUArch Arch = GPUArch::NVPTX64,
				GPURuntime Runtime = GPURuntime::CUDA);
				etherzhhbUnsubmitted Done Reply Inline Actions is this Runtime supposed to be with type GPURuntimeT ? it is a little bit tricky here. Maybe we need to introduce a PPCG header and define the runtime enum there, than include that runtime enum. or we can declare the function as llvm::Pass createPPCGCodeGenerationPass(int Runtime = 0); to at least avoid the magic number 0 in line 86. etherzhhb:* is this Runtime supposed to be with type GPURuntimeT ? it is a little bit tricky here. Maybe we…
				PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions Yes, it would be. The reason it's not is exactly the one you mentioned. I was considering adding a PPCG header, but refrained from it because I was hesitant about creating a header 'just for one enum'. If you agree that this is a good solution, I will indeed introduce a new header for PPCG and define the enum there, to get rid of magic numbers. The second option seems reasonable too though. PhilippSchaad: Yes, it would be. The reason it's not is exactly the one you mentioned. I was considering…
				etherzhhbUnsubmitted Done Reply Inline Actions we could start from the second option if you think it is reasonable etherzhhb: we could start from the second option if you think it is reasonable
				MeinersburUnsubmitted Done Reply Inline Actions I assume you kept the arguments of type `int` to not include header files here. Meinersbur: I assume you kept the arguments of type `int` to not include header files here.
				PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions That is correct, yes. PhilippSchaad: That is correct, yes.
	#endif			#endif
	llvm::Pass *createIslScheduleOptimizerPass();			llvm::Pass *createIslScheduleOptimizerPass();
	llvm::Pass *createFlattenSchedulePass();			llvm::Pass *createFlattenSchedulePass();
	llvm::Pass *createDeLICMPass();			llvm::Pass *createDeLICMPass();

	extern char &CodePreparationID;			extern char &CodePreparationID;
	} // namespace polly			} // namespace polly

	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

lib/CodeGen/PPCGCodeGeneration.cpp

//===------ PPCGCodeGeneration.cpp - Polly Accelerator Code Generation. ---===//		//===------ PPCGCodeGeneration.cpp - Polly Accelerator Code Generation. ---===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Take a scop created by ScopInfo and map it to GPU code using the ppcg		// Take a scop created by ScopInfo and map it to GPU code using the ppcg
// GPU mapping strategy.		// GPU mapping strategy.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include "polly/CodeGen/PPCGCodeGeneration.h"
#include "polly/CodeGen/IslAst.h"		#include "polly/CodeGen/IslAst.h"
#include "polly/CodeGen/IslNodeBuilder.h"		#include "polly/CodeGen/IslNodeBuilder.h"
#include "polly/CodeGen/Utils.h"		#include "polly/CodeGen/Utils.h"
#include "polly/DependenceInfo.h"		#include "polly/DependenceInfo.h"
#include "polly/LinkAllPasses.h"		#include "polly/LinkAllPasses.h"
#include "polly/Options.h"		#include "polly/Options.h"
#include "polly/ScopDetection.h"		#include "polly/ScopDetection.h"
#include "polly/ScopInfo.h"		#include "polly/ScopInfo.h"
Show All 25 Lines

#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"

using namespace polly;		using namespace polly;
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "polly-codegen-ppcg"		#define DEBUG_TYPE "polly-codegen-ppcg"

static cl::opt<bool> DumpSchedule("polly-acc-dump-schedule",		static cl::opt<bool> DumpSchedule("polly-acc-dump-schedule",
cl::desc("Dump the computed GPU Schedule"),		cl::desc("Dump the computed GPU Schedule"),
cl::Hidden, cl::init(false), cl::ZeroOrMore,		cl::Hidden, cl::init(false), cl::ZeroOrMore,
		MeinersburUnsubmitted Done Reply Inline Actions Did you consider an enum? Meinersbur: Did you consider an enum?
		grosserUnsubmitted Done Reply Inline Actions You can use C++11 enums ala enum class GPURuntime { CUDA, OpenCL }; grosser: You can use C++11 enums ala enum class GPURuntime { CUDA, OpenCL };
cl::cat(PollyCategory));		cl::cat(PollyCategory));

		MeinersburUnsubmitted Done Reply Inline Actions See http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly for LLVM's coding policy for enum members. Nitpick: A "T" suffix is rather unusual. Meinersbur: See [[ http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators…
static cl::opt<bool>		static cl::opt<bool>
DumpCode("polly-acc-dump-code",		DumpCode("polly-acc-dump-code",
cl::desc("Dump C code describing the GPU mapping"), cl::Hidden,		cl::desc("Dump C code describing the GPU mapping"), cl::Hidden,
cl::init(false), cl::ZeroOrMore, cl::cat(PollyCategory));		cl::init(false), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<bool> DumpKernelIR("polly-acc-dump-kernel-ir",		static cl::opt<bool> DumpKernelIR("polly-acc-dump-kernel-ir",
cl::desc("Dump the kernel LLVM-IR"),		cl::desc("Dump the kernel LLVM-IR"),
cl::Hidden, cl::init(false), cl::ZeroOrMore,		cl::Hidden, cl::init(false), cl::ZeroOrMore,
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
/// for generating GPU specific user nodes.		/// for generating GPU specific user nodes.
///		///
/// @see GPUNodeBuilder::createUser		/// @see GPUNodeBuilder::createUser
class GPUNodeBuilder : public IslNodeBuilder {		class GPUNodeBuilder : public IslNodeBuilder {
public:		public:
GPUNodeBuilder(PollyIRBuilder &Builder, ScopAnnotator &Annotator,		GPUNodeBuilder(PollyIRBuilder &Builder, ScopAnnotator &Annotator,
const DataLayout &DL, LoopInfo &LI, ScalarEvolution &SE,		const DataLayout &DL, LoopInfo &LI, ScalarEvolution &SE,
DominatorTree &DT, Scop &S, BasicBlock *StartBlock,		DominatorTree &DT, Scop &S, BasicBlock *StartBlock,
gpu_prog *Prog)		gpu_prog *Prog, GPURuntime Runtime, GPUArch Arch)
		MeinersburUnsubmitted Done Reply Inline Actions Nitpick: No need to use an `enum` qualifier. Meinersbur: Nitpick: No need to use an `enum` qualifier.
: IslNodeBuilder(Builder, Annotator, DL, LI, SE, DT, S, StartBlock),		: IslNodeBuilder(Builder, Annotator, DL, LI, SE, DT, S, StartBlock),
Prog(Prog) {		Prog(Prog), Runtime(Runtime), Arch(Arch) {
getExprBuilder().setIDToSAI(&IDToSAI);		getExprBuilder().setIDToSAI(&IDToSAI);
}		}

/// Create after-run-time-check initialization code.		/// Create after-run-time-check initialization code.
void initializeAfterRTH();		void initializeAfterRTH();

/// Finalize the generated scop.		/// Finalize the generated scop.
virtual void finalize();		virtual void finalize();
Show All 29 Lines	private:
/// A module containing GPU code.		/// A module containing GPU code.
///		///
/// This pointer is only set in case we are currently generating GPU code.		/// This pointer is only set in case we are currently generating GPU code.
std::unique_ptr<Module> GPUModule;		std::unique_ptr<Module> GPUModule;

/// The GPU program we generate code for.		/// The GPU program we generate code for.
gpu_prog *Prog;		gpu_prog *Prog;

		/// The GPU Runtime implementation to use (OpenCL or CUDA).
		GPURuntime Runtime;

		/// The GPU Architecture to target.
		GPUArch Arch;

/// Class to free isl_ids.		/// Class to free isl_ids.
class IslIdDeleter {		class IslIdDeleter {
public:		public:
void operator()(__isl_take isl_id *Id) { isl_id_free(Id); };		void operator()(__isl_take isl_id *Id) { isl_id_free(Id); };
};		};

/// A set containing all isl_ids allocated in a GPU kernel.		/// A set containing all isl_ids allocated in a GPU kernel.
///		///
▲ Show 20 Lines • Show All 535 Lines • ▼ Show 20 Lines	if (!F) {
FunctionType *Ty = FunctionType::get(Builder.getVoidTy(), false);		FunctionType *Ty = FunctionType::get(Builder.getVoidTy(), false);
F = Function::Create(Ty, Linkage, Name, M);		F = Function::Create(Ty, Linkage, Name, M);
}		}

Builder.CreateCall(F);		Builder.CreateCall(F);
}		}

Value *GPUNodeBuilder::createCallInitContext() {		Value *GPUNodeBuilder::createCallInitContext() {
const char *Name = "polly_initContext";		const char *Name;

		switch (Runtime) {
		case GPURuntime::CUDA:
		Name = "polly_initContextCUDA";
		break;
		MeinersburUnsubmitted Done Reply Inline Actions Can you make this a switch so you get warned by the compiler when adding more runtimes? Meinersbur: Can you make this a switch so you get warned by the compiler when adding more runtimes?
		case GPURuntime::OpenCL:
		Name = "polly_initContextCL";
		break;
		}

Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
Function *F = M->getFunction(Name);		Function *F = M->getFunction(Name);

// If F is not available, declare it.		// If F is not available, declare it.
if (!F) {		if (!F) {
GlobalValue::LinkageTypes Linkage = Function::ExternalLinkage;		GlobalValue::LinkageTypes Linkage = Function::ExternalLinkage;
std::vector<Type *> Args;		std::vector<Type *> Args;
FunctionType *Ty = FunctionType::get(Builder.getInt8PtrTy(), Args, false);		FunctionType *Ty = FunctionType::get(Builder.getInt8PtrTy(), Args, false);
▲ Show 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	void GPUNodeBuilder::createScopStmt(isl_ast_expr *Expr,
if (Stmt->isBlockStmt())		if (Stmt->isBlockStmt())
BlockGen.copyStmt(*Stmt, LTS, Indexes);		BlockGen.copyStmt(*Stmt, LTS, Indexes);
else		else
RegionGen.copyStmt(*Stmt, LTS, Indexes);		RegionGen.copyStmt(*Stmt, LTS, Indexes);
}		}

void GPUNodeBuilder::createKernelSync() {		void GPUNodeBuilder::createKernelSync() {
Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
auto *Sync = Intrinsic::getDeclaration(M, Intrinsic::nvvm_barrier0);
		Function *Sync;

		switch (Arch) {
		case GPUArch::NVPTX64:
		Sync = Intrinsic::getDeclaration(M, Intrinsic::nvvm_barrier0);
		break;
		}

Builder.CreateCall(Sync, {});		Builder.CreateCall(Sync, {});
}		}

/// Collect llvm::Values referenced from @p Node		/// Collect llvm::Values referenced from @p Node
///		///
/// This function only applies to isl_ast_nodes that are user_nodes referring		/// This function only applies to isl_ast_nodes that are user_nodes referring
/// to a ScopStmt. All other node types are ignore.		/// to a ScopStmt. All other node types are ignore.
///		///
▲ Show 20 Lines • Show All 389 Lines • ▼ Show 20 Lines	GPUNodeBuilder::createKernelFunctionDecl(ppcg_kernel *Kernel,
}		}

for (auto *V : SubtreeValues)		for (auto *V : SubtreeValues)
Args.push_back(V->getType());		Args.push_back(V->getType());

auto *FT = FunctionType::get(Builder.getVoidTy(), Args, false);		auto *FT = FunctionType::get(Builder.getVoidTy(), Args, false);
auto *FN = Function::Create(FT, Function::ExternalLinkage, Identifier,		auto *FN = Function::Create(FT, Function::ExternalLinkage, Identifier,
GPUModule.get());		GPUModule.get());

		switch (Arch) {
		case GPUArch::NVPTX64:
FN->setCallingConv(CallingConv::PTX_Kernel);		FN->setCallingConv(CallingConv::PTX_Kernel);
		break;
		}

auto Arg = FN->arg_begin();		auto Arg = FN->arg_begin();
for (long i = 0; i < Kernel->n_array; i++) {		for (long i = 0; i < Kernel->n_array; i++) {
if (!ppcg_kernel_requires_array_argument(Kernel, i))		if (!ppcg_kernel_requires_array_argument(Kernel, i))
continue;		continue;

Arg->setName(Kernel->array[i].array->name);		Arg->setName(Kernel->array[i].array->name);

▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	for (auto *V : SubtreeValues) {
ValueMap[V] = &*Arg;		ValueMap[V] = &*Arg;
Arg++;		Arg++;
}		}

return FN;		return FN;
}		}

void GPUNodeBuilder::insertKernelIntrinsics(ppcg_kernel *Kernel) {		void GPUNodeBuilder::insertKernelIntrinsics(ppcg_kernel *Kernel) {
Intrinsic::ID IntrinsicsBID[] = {Intrinsic::nvvm_read_ptx_sreg_ctaid_x,		Intrinsic::ID IntrinsicsBID[2];
Intrinsic::nvvm_read_ptx_sreg_ctaid_y};		Intrinsic::ID IntrinsicsTID[3];

Intrinsic::ID IntrinsicsTID[] = {Intrinsic::nvvm_read_ptx_sreg_tid_x,		switch (Arch) {
Intrinsic::nvvm_read_ptx_sreg_tid_y,		case GPUArch::NVPTX64:
Intrinsic::nvvm_read_ptx_sreg_tid_z};		IntrinsicsBID[0] = Intrinsic::nvvm_read_ptx_sreg_ctaid_x;
		IntrinsicsBID[1] = Intrinsic::nvvm_read_ptx_sreg_ctaid_y;

		IntrinsicsTID[0] = Intrinsic::nvvm_read_ptx_sreg_tid_x;
		IntrinsicsTID[1] = Intrinsic::nvvm_read_ptx_sreg_tid_y;
		IntrinsicsTID[2] = Intrinsic::nvvm_read_ptx_sreg_tid_z;
		break;
		}

auto addId = [this](__isl_take isl_id *Id, Intrinsic::ID Intr) mutable {		auto addId = [this](__isl_take isl_id *Id, Intrinsic::ID Intr) mutable {
std::string Name = isl_id_get_name(Id);		std::string Name = isl_id_get_name(Id);
Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
Function *IntrinsicFn = Intrinsic::getDeclaration(M, Intr);		Function *IntrinsicFn = Intrinsic::getDeclaration(M, Intr);
Value *Val = Builder.CreateCall(IntrinsicFn, {});		Value *Val = Builder.CreateCall(IntrinsicFn, {});
Val = Builder.CreateIntCast(Val, Builder.getInt64Ty(), false, Name);		Val = Builder.CreateIntCast(Val, Builder.getInt64Ty(), false, Name);
IDToValue[Id] = Val;		IDToValue[Id] = Val;
▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	for (int i = 0; i < Kernel->n_var; ++i) {
LocalArrays.push_back(Allocation);		LocalArrays.push_back(Allocation);
KernelIds.push_back(Id);		KernelIds.push_back(Id);
IDToSAI[Id] = SAI;		IDToSAI[Id] = SAI;
}		}
}		}

void GPUNodeBuilder::createKernelFunction(ppcg_kernel *Kernel,		void GPUNodeBuilder::createKernelFunction(ppcg_kernel *Kernel,
SetVector<Value *> &SubtreeValues) {		SetVector<Value *> &SubtreeValues) {

std::string Identifier = "kernel_" + std::to_string(Kernel->id);		std::string Identifier = "kernel_" + std::to_string(Kernel->id);
GPUModule.reset(new Module(Identifier, Builder.getContext()));		GPUModule.reset(new Module(Identifier, Builder.getContext()));

		switch (Arch) {
		case GPUArch::NVPTX64:
		if (Runtime == GPURuntime::CUDA)
		MeinersburUnsubmitted Done Reply Inline Actions Is there some vendor-neutral triple? Meinersbur: Is there some vendor-neutral triple?
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions Do you mean like `nvptx64-nvcl` / `nvptx64-cuda`? PhilippSchaad: Do you mean like `nvptx64-nvcl` / `nvptx64-cuda`?
		MeinersburUnsubmitted Done Reply Inline Actions I hoped that there might be some kind of triple that works for OpenCL in general, not only for nvidia (`nvptx`, `nvcl`). If the generated program only works for devices that support cuda anyway, I don't see where the benefit of such a backend is. If there is indeed no backend that also works on non-nvidia devices, should we call the the runtime accordingly, e.g. "nvcl" then? Meinersbur: I hoped that there might be some kind of triple that works for OpenCL in general, not only for…
		etherzhhbUnsubmitted Done Reply Inline Actions for opencl, it can be "spir-unknown-unknown" or "spir64-unknown-unknown", but that may not work :) etherzhhb: for opencl, it can be "spir-unknown-unknown" or "spir64-unknown-unknown", but that may not work…
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions Looking into it. The next goal would be to add the AMDGPU backend to generate AMD ISA, which would then again utilize the same OpenCL Runtime implemented here. (I realize there will have to be some naming changes to make that clear in the `GPUJIT`, but as you pointed out, I have a naming-mess to fix there anyway. PhilippSchaad: Looking into it. The next goal would be to add the AMDGPU backend to generate AMD ISA, which…
		grosserUnsubmitted Done Reply Inline Actions Making OpenCL work for CUDA is just the first step. I expect that when adding AMDGPU support, we will use here different triples depending on which vendor to target. AMD will have a specific one, CUDA will have a specific one, and for Intel we likely use the generic SPIR-V comment. I assume this could then also work for Xilinx. grosser: Making OpenCL work for CUDA is just the first step. I expect that when adding AMDGPU support…
		MeinersburUnsubmitted Done Reply Inline Actions At compile time, we don't know on which hardware it will run on, so we cannot specify a triple here. Unless you think of a runtime dispatch system, then you need to generate all kernels at once. In that case, I still would like to select a single target only for when I know I will run only on that hardware and to keep the executable small. Meinersbur: At compile time, we don't know on which hardware it will run on, so we cannot specify a triple…
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions I thought the goal was to let the user compile for a specific target, i.e. providing something like -polly-gpu-arch=amd/nvidia/intel, and then choosing the correct target triple according to said selection. Meaning for example -polly-gpu-arch=amd would utilize the AMDGPU backend triple and feed that into the OpenCL runtime. Am I misunderstanding something? PhilippSchaad: I thought the goal was to let the user compile for a specific target, i.e. providing something…
		MeinersburUnsubmitted Done Reply Inline Actions I think we were miscommunicating. The -polly-gpu-arch switch is new to me and doesn't appear in this patch. I assumed a fat executable when you mentioned an AMD backend. OpenCL claims to be hardware-independent with platform's driver translating OpenCL-C or SPIR(-V) to its proprietary format. In NVidia's terminology, CUDA is a platform of which CUDA C++, their OpenCL implementation, cudart (CUDA runtime) etc. are part of. We are still vendor-locked to CUDA, since it only works with CUDA's OpenCL runtime library. -polly-gpu-runtime=opencl therefore is misleading (at least it was to me), it it no alternative to CUDA. It might resolve if it is indeed just the runtime library GPUJIT is linked to. If so, could you make it more clear? I suggest the following switches: -polly-target=cpu/gpu if -polly-target=gpu then -polly-gpu-arch=nvptx64/hsa/spir/spir-v/opencl-c (with -polly-gpu-arch=nvptx64 the only one implemented so far) if -polly-gpu-arch=nvptx64 then there is a choice between -polly-cuda-runtime=libcudart/libopencl Meinersbur: I think we were miscommunicating. The -polly-gpu-arch switch is new to me and doesn't appear in…
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions This change should address exactly this. The framework is now set to introduce new architectures and utilize eg. the AMDGPU backend instead of NVPTX etc. PhilippSchaad: This change should address exactly this. The framework is now set to introduce new…
GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-cuda"));		GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-cuda"));
		else if (Runtime == GPURuntime::OpenCL)
		GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-nvcl"));
GPUModule->setDataLayout(computeNVPTXDataLayout(true /* is64Bit */));		GPUModule->setDataLayout(computeNVPTXDataLayout(true /* is64Bit */));
		break;
		}

Function *FN = createKernelFunctionDecl(Kernel, SubtreeValues);		Function *FN = createKernelFunctionDecl(Kernel, SubtreeValues);

BasicBlock *PrevBlock = Builder.GetInsertBlock();		BasicBlock *PrevBlock = Builder.GetInsertBlock();
auto EntryBlock = BasicBlock::Create(Builder.getContext(), "entry", FN);		auto EntryBlock = BasicBlock::Create(Builder.getContext(), "entry", FN);

DT.addNewBlock(EntryBlock, PrevBlock);		DT.addNewBlock(EntryBlock, PrevBlock);

Builder.SetInsertPoint(EntryBlock);		Builder.SetInsertPoint(EntryBlock);
Builder.CreateRetVoid();		Builder.CreateRetVoid();
Builder.SetInsertPoint(EntryBlock, EntryBlock->begin());		Builder.SetInsertPoint(EntryBlock, EntryBlock->begin());

ScopDetection::markFunctionAsInvalid(FN);		ScopDetection::markFunctionAsInvalid(FN);

prepareKernelArguments(Kernel, FN);		prepareKernelArguments(Kernel, FN);
createKernelVariables(Kernel, FN);		createKernelVariables(Kernel, FN);
insertKernelIntrinsics(Kernel);		insertKernelIntrinsics(Kernel);
}		}

std::string GPUNodeBuilder::createKernelASM() {		std::string GPUNodeBuilder::createKernelASM() {
llvm::Triple GPUTriple(Triple::normalize("nvptx64-nvidia-cuda"));		llvm::Triple GPUTriple;

		switch (Arch) {
		case GPUArch::NVPTX64:
		switch (Runtime) {
		case GPURuntime::CUDA:
		GPUTriple = llvm::Triple(Triple::normalize("nvptx64-nvidia-cuda"));
		break;
		singam-sanjayUnsubmitted Done Reply Inline Actions Does `nvptx64-nvidia-nvcl` mean OpenCL code meant to be run on NVIDIA GPUs ? singam-sanjay: Does `nvptx64-nvidia-nvcl` mean OpenCL code meant to be run on NVIDIA GPUs ?
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions Yes, exactly. It generates a slightly different flavor of PTX, which can be used by OpenCL to generate a kernel from the PTX binary (on NVIDIA GPUs). If you were to use the standard CUDA PTX, OpenCL would complain because of wrong argument accesses. PhilippSchaad: Yes, exactly. It generates a slightly different flavor of PTX, which can be used by OpenCL to…
		singam-sanjayUnsubmitted Done Reply Inline Actions Okay. From what you're saying, `nvptx64-nvidia-nvcl` indicates that backend must generate NVPTX code for a 64bit architecture for an NVIDIA GPU controlled by a OpenCL driver. Please correct me if I'm wrong. singam-sanjay: Okay. From what you're saying, `nvptx64-nvidia-nvcl` indicates that backend must generate NVPTX…
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions That is correct. PhilippSchaad: That is correct.
		singam-sanjayUnsubmitted Done Reply Inline Actions Thank you ! That was helpful. singam-sanjay: Thank you ! That was helpful.
		MeinersburUnsubmitted Done Reply Inline Actions Could also be a switch. Meinersbur: Could also be a switch.
		case GPURuntime::OpenCL:
		GPUTriple = llvm::Triple(Triple::normalize("nvptx64-nvidia-nvcl"));
		break;
		}
		break;
		}

std::string ErrMsg;		std::string ErrMsg;
auto GPUTarget = TargetRegistry::lookupTarget(GPUTriple.getTriple(), ErrMsg);		auto GPUTarget = TargetRegistry::lookupTarget(GPUTriple.getTriple(), ErrMsg);

if (!GPUTarget) {		if (!GPUTarget) {
errs() << ErrMsg << "\n";		errs() << ErrMsg << "\n";
return "";		return "";
}		}

TargetOptions Options;		TargetOptions Options;
Options.UnsafeFPMath = FastMath;		Options.UnsafeFPMath = FastMath;
std::unique_ptr<TargetMachine> TargetM(
GPUTarget->createTargetMachine(GPUTriple.getTriple(), CudaVersion, "",		std::string subtarget;
Options, Optional<Reloc::Model>()));
		switch (Arch) {
		case GPUArch::NVPTX64:
		subtarget = CudaVersion;
		break;
		}

		std::unique_ptr<TargetMachine> TargetM(GPUTarget->createTargetMachine(
		GPUTriple.getTriple(), subtarget, "", Options, Optional<Reloc::Model>()));

SmallString<0> ASMString;		SmallString<0> ASMString;
raw_svector_ostream ASMStream(ASMString);		raw_svector_ostream ASMStream(ASMString);
llvm::legacy::PassManager PM;		llvm::legacy::PassManager PM;

PM.add(createTargetTransformInfoWrapperPass(TargetM->getTargetIRAnalysis()));		PM.add(createTargetTransformInfoWrapperPass(TargetM->getTargetIRAnalysis()));

if (TargetM->addPassesToEmitFile(		if (TargetM->addPassesToEmitFile(
Show All 34 Lines	std::string GPUNodeBuilder::finalizeKernelFunction() {

return Assembly;		return Assembly;
}		}

namespace {		namespace {
class PPCGCodeGeneration : public ScopPass {		class PPCGCodeGeneration : public ScopPass {
public:		public:
static char ID;		static char ID;

		MeinersburUnsubmitted Done Reply Inline Actions Why a static flag? Meinersbur: Why a static flag?
		GPURuntime Runtime = GPURuntime::CUDA;

		GPUArch Architecture = GPUArch::NVPTX64;

/// The scop that is currently processed.		/// The scop that is currently processed.
Scop *S;		Scop *S;

LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
ScalarEvolution *SE;		ScalarEvolution *SE;
const DataLayout *DL;		const DataLayout *DL;
RegionInfo *RI;		RegionInfo *RI;
▲ Show 20 Lines • Show All 767 Lines • ▼ Show 20 Lines	void generateCode(__isl_take isl_ast_node Root, gpu_prog Prog) {
// branch will guard the original scop from new induction variables that		// branch will guard the original scop from new induction variables that
// the SCEVExpander may introduce while code generating the parameters and		// the SCEVExpander may introduce while code generating the parameters and
// which may introduce scalar dependences that prevent us from correctly		// which may introduce scalar dependences that prevent us from correctly
// code generating this scop.		// code generating this scop.
BasicBlock *StartBlock =		BasicBlock *StartBlock =
executeScopConditionally(S, Builder.getTrue(), DT, RI, LI);		executeScopConditionally(S, Builder.getTrue(), DT, RI, LI);

GPUNodeBuilder NodeBuilder(Builder, Annotator, DL, LI, SE, DT, *S,		GPUNodeBuilder NodeBuilder(Builder, Annotator, DL, LI, SE, DT, *S,
StartBlock, Prog);		StartBlock, Prog, Runtime, Architecture);

// TODO: Handle LICM		// TODO: Handle LICM
auto SplitBlock = StartBlock->getSinglePredecessor();		auto SplitBlock = StartBlock->getSinglePredecessor();
Builder.SetInsertPoint(SplitBlock->getTerminator());		Builder.SetInsertPoint(SplitBlock->getTerminator());
NodeBuilder.addParameters(S->getContext());		NodeBuilder.addParameters(S->getContext());

isl_ast_build *Build = isl_ast_build_alloc(S->getIslCtx());		isl_ast_build *Build = isl_ast_build_alloc(S->getIslCtx());
isl_ast_expr *Condition = IslAst::buildRunCondition(S, Build);		isl_ast_expr *Condition = IslAst::buildRunCondition(S, Build);
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addPreserved<RegionInfoPass>();		AU.addPreserved<RegionInfoPass>();
AU.addPreserved<ScopInfoRegionPass>();		AU.addPreserved<ScopInfoRegionPass>();
}		}
};		};
} // namespace		} // namespace

char PPCGCodeGeneration::ID = 1;		char PPCGCodeGeneration::ID = 1;

Pass *polly::createPPCGCodeGenerationPass() { return new PPCGCodeGeneration(); }		Pass *polly::createPPCGCodeGenerationPass(GPUArch Arch, GPURuntime Runtime) {
		PPCGCodeGeneration *generator = new PPCGCodeGeneration();
		generator->Runtime = Runtime;
		generator->Architecture = Arch;
		return generator;
		}

		MeinersburUnsubmitted Done Reply Inline Actions Did you consider Pass polly::createPPCGCodeGenerationPass(int Runtime); ? Meinersbur:* Did you consider ``` Pass *polly::createPPCGCodeGenerationPass(int Runtime); ``` ?
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions That seems reasonable, but I get a template-conflict for the LLVM Pass-Creation template when trying to change the pass-creation-method structure. I thought it might be easier this way? PhilippSchaad: That seems reasonable, but I get a template-conflict for the LLVM Pass-Creation template when…
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions Correction: looking at wrong function of course, you mean a different one :-) PhilippSchaad: Correction: looking at wrong function of course, you mean a different one :-)
INITIALIZE_PASS_BEGIN(PPCGCodeGeneration, "polly-codegen-ppcg",		INITIALIZE_PASS_BEGIN(PPCGCodeGeneration, "polly-codegen-ppcg",
"Polly - Apply PPCG translation to SCOP", false, false)		"Polly - Apply PPCG translation to SCOP", false, false)
INITIALIZE_PASS_DEPENDENCY(DependenceInfo);		INITIALIZE_PASS_DEPENDENCY(DependenceInfo);
		MeinersburUnsubmitted Done Reply Inline Actions We should check whether the arguments are valid. Such as: switch (Runtime) { case 1: case 2: ... default: llvm_unreachable("Invalid argument for Runtime"); } Meinersbur: We should check whether the arguments are valid. Such as: ``` switch (Runtime) { case 1: case 2…
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass);		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass);
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass);		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass);
INITIALIZE_PASS_DEPENDENCY(RegionInfoPass);		INITIALIZE_PASS_DEPENDENCY(RegionInfoPass);
INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass);		INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass);
INITIALIZE_PASS_DEPENDENCY(ScopDetection);		INITIALIZE_PASS_DEPENDENCY(ScopDetection);
INITIALIZE_PASS_END(PPCGCodeGeneration, "polly-codegen-ppcg",		INITIALIZE_PASS_END(PPCGCodeGeneration, "polly-codegen-ppcg",
		MeinersburUnsubmitted Done Reply Inline Actions Similarly: switch (Arch) { case 1: ... default: llvm_unreachable("Invalid argument for Arch"); } Meinersbur: Similarly: ``` switch (Arch) { case 1: ... default: llvm_unreachable("Invalid argument for…
"Polly - Apply PPCG translation to SCOP", false, false)		"Polly - Apply PPCG translation to SCOP", false, false)

lib/Support/RegisterPasses.cpp

Show All 17 Lines
// changed, but that the flag '-polly' provided at optimization level '-O3'		// changed, but that the flag '-polly' provided at optimization level '-O3'
// enables additional polyhedral optimizations.		// enables additional polyhedral optimizations.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "polly/RegisterPasses.h"		#include "polly/RegisterPasses.h"
#include "polly/Canonicalization.h"		#include "polly/Canonicalization.h"
#include "polly/CodeGen/CodeGeneration.h"		#include "polly/CodeGen/CodeGeneration.h"
#include "polly/CodeGen/CodegenCleanup.h"		#include "polly/CodeGen/CodegenCleanup.h"
		#include "polly/CodeGen/PPCGCodeGeneration.h"
#include "polly/DeLICM.h"		#include "polly/DeLICM.h"
#include "polly/DependenceInfo.h"		#include "polly/DependenceInfo.h"
#include "polly/FlattenSchedule.h"		#include "polly/FlattenSchedule.h"
#include "polly/LinkAllPasses.h"		#include "polly/LinkAllPasses.h"
#include "polly/Options.h"		#include "polly/Options.h"
#include "polly/PolyhedralInfo.h"		#include "polly/PolyhedralInfo.h"
#include "polly/ScopDetection.h"		#include "polly/ScopDetection.h"
#include "polly/ScopInfo.h"		#include "polly/ScopInfo.h"
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	Target("polly-target", cl::desc("The hardware to target"),
cl::values(clEnumValN(TARGET_CPU, "cpu", "generate CPU code")		cl::values(clEnumValN(TARGET_CPU, "cpu", "generate CPU code")
#ifdef GPU_CODEGEN		#ifdef GPU_CODEGEN
,		,
clEnumValN(TARGET_GPU, "gpu", "generate GPU code")		clEnumValN(TARGET_GPU, "gpu", "generate GPU code")
#endif		#endif
),		),
cl::init(TARGET_CPU), cl::ZeroOrMore, cl::cat(PollyCategory));		cl::init(TARGET_CPU), cl::ZeroOrMore, cl::cat(PollyCategory));

		#ifdef GPU_CODEGEN
		static cl::opt<GPURuntime> GPURuntimeChoice(
		"polly-gpu-runtime", cl::desc("The GPU Runtime API to target"),
		cl::values(clEnumValN(GPURuntime::CUDA, "libcudart",
		"use the CUDA Runtime API"),
		clEnumValN(GPURuntime::OpenCL, "libopencl",
		"use the OpenCL Runtime API")),
		cl::init(GPURuntime::CUDA), cl::ZeroOrMore, cl::cat(PollyCategory));

		static cl::opt<GPUArch>
		GPUArchChoice("polly-gpu-arch", cl::desc("The GPU Architecture to target"),
		cl::values(clEnumValN(GPUArch::NVPTX64, "nvptx64",
		"target NVIDIA 64-bit architecture")),
		cl::init(GPUArch::NVPTX64), cl::ZeroOrMore,
		cl::cat(PollyCategory));
		#endif

VectorizerChoice polly::PollyVectorizerChoice;		VectorizerChoice polly::PollyVectorizerChoice;
static cl::opt<polly::VectorizerChoice, true> Vectorizer(		static cl::opt<polly::VectorizerChoice, true> Vectorizer(
"polly-vectorizer", cl::desc("Select the vectorization strategy"),		"polly-vectorizer", cl::desc("Select the vectorization strategy"),
cl::values(		cl::values(
clEnumValN(polly::VECTORIZER_NONE, "none", "No Vectorization"),		clEnumValN(polly::VECTORIZER_NONE, "none", "No Vectorization"),
clEnumValN(polly::VECTORIZER_POLLY, "polly",		clEnumValN(polly::VECTORIZER_POLLY, "polly",
"Polly internal vectorizer"),		"Polly internal vectorizer"),
clEnumValN(		clEnumValN(
▲ Show 20 Lines • Show All 190 Lines • ▼ Show 20 Lines	case OPTIMIZER_ISL:
PM.add(polly::createIslScheduleOptimizerPass());		PM.add(polly::createIslScheduleOptimizerPass());
break;		break;
}		}
}		}

if (ExportJScop)		if (ExportJScop)
PM.add(polly::createJSONExporterPass());		PM.add(polly::createJSONExporterPass());

if (Target == TARGET_GPU) {		if (Target == TARGET_GPU) {
#ifdef GPU_CODEGEN		#ifdef GPU_CODEGEN
PM.add(polly::createPPCGCodeGenerationPass());		PM.add(
		polly::createPPCGCodeGenerationPass(GPUArchChoice, GPURuntimeChoice));
#endif		#endif
		MeinersburUnsubmitted Done Reply Inline Actions A switch instead? Meinersbur: A switch instead?
		MeinersburUnsubmitted Done Reply Inline Actions Now that `createPPCGCodeGenerationPass` takes an argument, you don't need a switch anymore. Meinersbur: Now that `createPPCGCodeGenerationPass` takes an argument, you don't need a switch anymore.
		MeinersburUnsubmitted Done Reply Inline Actions int Arch; switch (GPUArch) { case GPU_ARCH_NVPTX64; Arch = 1; break; } int Runtime; switch (GPURuntime) { case GPU_RUNTIME_CUDA: Runtime = 1; break; case GPU_RUNTIME_OPENCL: Runtime = 2; breal } PM.add(polly::createPPCGCodeGenerationPass(Arch, Runtime)) With "you don't need a switch anymore", I was thinking about the like of: PM.add(polly::createPPCGCodeGenerationPass(1, GPURuntime + 1)); Your choice. It could be helpful to have `createPPCGCodeGenerationPass` accept the GPURuntime enum as arguments instead. In your solution, I don't see the use of `static const int` local variables. If you want identifiers that give names to the accepted arguments, declare them as `#define` or `static const int` in the header file that also declares `createPPCGCodeGenerationPass`, so these can be used in the implementation of `createPPCGCodeGenerationPass` as well. Meinersbur: ``` int Arch; switch (GPUArch) { case GPU_ARCH_NVPTX64; Arch = 1; break; }…
		PhilippSchaadAuthorUnsubmitted Done Reply Inline Actions With "you don't need a switch anymore", I was thinking about the like of: PM.add(polly::createPPCGCodeGenerationPass(1, GPURuntime + 1)); Your choice. Personally, I think the current solution is tiny little bit more 'documenting'. Both is fine though, good call. It could be helpful to have createPPCGCodeGenerationPass accept the GPURuntime enum as arguments instead. That is true, would mean having to provide a header with that GPURuntime enum instead though, right? In your solution, I don't see the use of static const int local variables. If you want identifiers that give names to the accepted arguments, declare them as #define or static const int in the header file that also declares createPPCGCodeGenerationPass, so these can be used in the implementation of createPPCGCodeGenerationPass as well. Would it maybe make sense to introduce a PPCGCodeGeneration header at this point? PhilippSchaad: > With "you don't need a switch anymore", I was thinking about the like of: ``` PM.add(polly…
		MeinersburUnsubmitted Done Reply Inline Actions Please remove at least the `static` keyword. It makes sense for global constants, but not for function-local ones. The style static const int ArgumentName = 0; func(ArgumentName); rather unusual in LLVM-style code (but not bad if applied consistenly, which is unfortunately not the case in Polly). I've seen func(/* ArgumentName = / 0); much more often. In this case I think `UseOpenCLRuntime`, `UseCUDARuntime` and `TargetNVPTX64` should really be global constants declated close to the declaration of `createPPCGCodeGenerationPass` so it can be used by ever caller of that function. Would it maybe make sense to introduce a PPCGCodeGeneration header at this point? Yes, that sounds good to me as well. Meinersbur:* Please remove at least the `static` keyword. It makes sense for global constants, but not for…
} else {		} else {
switch (CodeGeneration) {		switch (CodeGeneration) {
case CODEGEN_AST:		case CODEGEN_AST:
PM.add(polly::createIslAstInfoPass());		PM.add(polly::createIslAstInfoPass());
break;		break;
case CODEGEN_FULL:		case CODEGEN_FULL:
PM.add(polly::createCodeGenerationPass());		PM.add(polly::createCodeGenerationPass());
break;		break;
▲ Show 20 Lines • Show All 125 Lines • Show Last 20 Lines

test/GPGPU/cuda-managed-memory-simple.ll

	Show All 29 Lines
	; }			; }
	;			;

	; CHECK-NOT: polly_copyFromHostToDevice			; CHECK-NOT: polly_copyFromHostToDevice
	; CHECK-NOT: polly_copyFromDeviceToHost			; CHECK-NOT: polly_copyFromDeviceToHost
	; CHECK-NOT: polly_freeDeviceMemory			; CHECK-NOT: polly_freeDeviceMemory
	; CHECK-NOT: polly_allocateMemoryForDevice			; CHECK-NOT: polly_allocateMemoryForDevice

	; CHECK: %13 = call i8* @polly_initContext()			; CHECK: %13 = call i8* @polly_initContextCUDA()
	; CHECK-NEXT: %14 = bitcast i32* %A to i8*			; CHECK-NEXT: %14 = bitcast i32* %A to i8*
	; CHECK-NEXT: %15 = getelementptr [2 x i8], [2 x i8]* %polly_launch_0_params, i64 0, i64 0			; CHECK-NEXT: %15 = getelementptr [2 x i8], [2 x i8]* %polly_launch_0_params, i64 0, i64 0
	; CHECK-NEXT: store i8* %14, i8** %polly_launch_0_param_0			; CHECK-NEXT: store i8* %14, i8** %polly_launch_0_param_0
	; CHECK-NEXT: %16 = bitcast i8** %polly_launch_0_param_0 to i8*			; CHECK-NEXT: %16 = bitcast i8** %polly_launch_0_param_0 to i8*
	; CHECK-NEXT: store i8* %16, i8** %15			; CHECK-NEXT: store i8* %16, i8** %15
	; CHECK-NEXT: %17 = bitcast i32* %R to i8*			; CHECK-NEXT: %17 = bitcast i32* %R to i8*
	; CHECK-NEXT: %18 = getelementptr [2 x i8], [2 x i8]* %polly_launch_0_params, i64 0, i64 1			; CHECK-NEXT: %18 = getelementptr [2 x i8], [2 x i8]* %polly_launch_0_params, i64 0, i64 1
	; CHECK-NEXT: store i8* %17, i8** %polly_launch_0_param_1			; CHECK-NEXT: store i8* %17, i8** %polly_launch_0_param_1
	; CHECK-NEXT: %19 = bitcast i8** %polly_launch_0_param_1 to i8*			; CHECK-NEXT: %19 = bitcast i8** %polly_launch_0_param_1 to i8*
	; CHECK-NEXT: store i8* %19, i8** %18			; CHECK-NEXT: store i8* %19, i8** %18
	; CHECK-NEXT: %20 = call i8* @polly_getKernel(i8* getelementptr inbounds ([750 x i8], [750 x i8]* @kernel_0, i32 0, i32 0), i8* getelementptr inbounds ([9 x i8], [9 x i8]* @kernel_0_name, i32 0, i32 0))			; CHECK-NEXT: %20 = call i8* @polly_getKernel(i8* getelementptr inbounds ([750 x i8], [750 x i8]* @kernel_0, i32 0, i32 0), i8* getelementptr inbounds ([9 x i8], [9 x i8]* @kernel_0_name, i32 0, i32 0))
				grosserUnsubmitted Not Done Reply Inline Actions This change is unrelated. grosser: This change is unrelated.
				PhilippSchaadAuthorUnsubmitted Not Done Reply Inline Actions It is, but it got fixed in the meantime anyway. Removing it. PhilippSchaad: It is, but it got fixed in the meantime anyway. Removing it.
	; CHECK-NEXT: call void @polly_launchKernel(i8* %20, i32 2, i32 1, i32 32, i32 1, i32 1, i8* %polly_launch_0_params_i8ptr)			; CHECK-NEXT: call void @polly_launchKernel(i8* %20, i32 2, i32 1, i32 32, i32 1, i32 1, i8* %polly_launch_0_params_i8ptr)
	; CHECK-NEXT: call void @polly_freeKernel(i8* %20)			; CHECK-NEXT: call void @polly_freeKernel(i8* %20)
	; CHECK-NEXT: call void @polly_synchronizeDevice()			; CHECK-NEXT: call void @polly_synchronizeDevice()
	; CHECK-NEXT: call void @polly_freeContext(i8* %13)			; CHECK-NEXT: call void @polly_freeContext(i8* %13)

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define void @copy(i32* %R, i32* %A) {			define void @copy(i32* %R, i32* %A) {
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

test/GPGPU/size-cast.ll

	Show All 23 Lines
	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_arg2, dev_MemRef_arg2, (arg) * sizeof(double), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_arg2, dev_MemRef_arg2, (arg) * sizeof(double), cudaMemcpyDeviceToHost));
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: for (int c0 = 0; c0 <= (arg - 32 * b0 - 1) / 1048576; c0 += 1)			; CODE-NEXT: for (int c0 = 0; c0 <= (arg - 32 * b0 - 1) / 1048576; c0 += 1)
	; CODE-NEXT: if (arg >= 32 * b0 + t0 + 1048576 * c0 + 1)			; CODE-NEXT: if (arg >= 32 * b0 + t0 + 1048576 * c0 + 1)
	; CODE-NEXT: Stmt_bb6(0, 32 * b0 + t0 + 1048576 * c0);			; CODE-NEXT: Stmt_bb6(0, 32 * b0 + t0 + 1048576 * c0);

	; IR: call i8* @polly_initContext()			; IR: call i8* @polly_initContextCUDA()
	; IR-NEXT: sext i32 %arg to i64			; IR-NEXT: sext i32 %arg to i64
	; IR-NEXT: mul i64			; IR-NEXT: mul i64
	; IR-NEXT: @polly_allocateMemoryForDevice			; IR-NEXT: @polly_allocateMemoryForDevice

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	define void @hoge(i32 %arg, i32 %arg1, [1000 x double]* %arg2, double* %arg3) {			define void @hoge(i32 %arg, i32 %arg1, [1000 x double]* %arg2, double* %arg3) {
	Show All 25 Lines

tools/CMakeLists.txt

	if (CUDALIB_FOUND)			if (CUDALIB_FOUND OR OpenCL_FOUND)
	add_subdirectory(GPURuntime)			add_subdirectory(GPURuntime)
	endif (CUDALIB_FOUND)			endif (CUDALIB_FOUND OR OpenCL_FOUND)

	set(LLVM_COMMON_DEPENDS ${LLVM_COMMON_DEPENDS} PARENT_SCOPE)			set(LLVM_COMMON_DEPENDS ${LLVM_COMMON_DEPENDS} PARENT_SCOPE)

tools/GPURuntime/GPUJIT.h

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	* polly_copyFromDeviceToHost(HostData, DevData, MemSize);			* polly_copyFromDeviceToHost(HostData, DevData, MemSize);
	* polly_freeKernel(Kernel);			* polly_freeKernel(Kernel);
	* polly_freeDeviceMemory(DevArray);			* polly_freeDeviceMemory(DevArray);
	* polly_freeContext(Context);			* polly_freeContext(Context);
	* }			* }
	*			*
	*/			*/

				typedef enum PollyGPURuntimeT {
				RUNTIME_NONE,
				RUNTIME_CUDA,
				RUNTIME_CL
				} PollyGPURuntime;

	typedef struct PollyGPUContextT PollyGPUContext;			typedef struct PollyGPUContextT PollyGPUContext;
	typedef struct PollyGPUFunctionT PollyGPUFunction;			typedef struct PollyGPUFunctionT PollyGPUFunction;
	typedef struct PollyGPUDevicePtrT PollyGPUDevicePtr;			typedef struct PollyGPUDevicePtrT PollyGPUDevicePtr;

	PollyGPUContext *polly_initContext();			typedef struct OpenCLContextT OpenCLContext;
	PollyGPUFunction polly_getKernel(const char PTXBuffer,			typedef struct OpenCLKernelT OpenCLKernel;
				typedef struct OpenCLDevicePtrT OpenCLDevicePtr;

				typedef struct CUDAContextT CUDAContext;
				typedef struct CUDAKernelT CUDAKernel;
				typedef struct CUDADevicePtrT CUDADevicePtr;

				PollyGPUContext *polly_initContextCUDA();
				PollyGPUContext *polly_initContextCL();
				PollyGPUFunction polly_getKernel(const char BinaryBuffer,
	const char *KernelName);			const char *KernelName);
	void polly_freeKernel(PollyGPUFunction *Kernel);			void polly_freeKernel(PollyGPUFunction *Kernel);
	void polly_copyFromHostToDevice(void HostData, PollyGPUDevicePtr DevData,			void polly_copyFromHostToDevice(void HostData, PollyGPUDevicePtr DevData,
	long MemSize);			long MemSize);
	void polly_copyFromDeviceToHost(PollyGPUDevicePtr DevData, void HostData,			void polly_copyFromDeviceToHost(PollyGPUDevicePtr DevData, void HostData,
	long MemSize);			long MemSize);
	void polly_synchronizeDevice();			void polly_synchronizeDevice();
	void polly_launchKernel(PollyGPUFunction *Kernel, unsigned int GridDimX,			void polly_launchKernel(PollyGPUFunction *Kernel, unsigned int GridDimX,
	unsigned int GridDimY, unsigned int BlockSizeX,			unsigned int GridDimY, unsigned int BlockSizeX,
	unsigned int BlockSizeY, unsigned int BlockSizeZ,			unsigned int BlockSizeY, unsigned int BlockSizeZ,
	void **Parameters);			void **Parameters);
	void polly_freeDeviceMemory(PollyGPUDevicePtr *Allocation);			void polly_freeDeviceMemory(PollyGPUDevicePtr *Allocation);
	void polly_freeContext(PollyGPUContext *Context);			void polly_freeContext(PollyGPUContext *Context);
	#endif /* GPUJIT_H_ */			#endif /* GPUJIT_H_ */

tools/GPURuntime/GPUJIT.c

/****************** GPUJIT.c - GPUJIT Execution Engine ********************/		/****************** GPUJIT.c - GPUJIT Execution Engine ********************/
/* */		/* */
/* The LLVM Compiler Infrastructure */		/* The LLVM Compiler Infrastructure */
/* */		/* */
/* This file is dual licensed under the MIT and the University of Illinois */		/* This file is dual licensed under the MIT and the University of Illinois */
/* Open Source License. See LICENSE.TXT for details. */		/* Open Source License. See LICENSE.TXT for details. */
/* */		/* */
/******************************************************************************/		/******************************************************************************/
/* */		/* */
/* This file implements GPUJIT, a ptx string execution engine for GPU. */		/* This file implements GPUJIT, a ptx string execution engine for GPU. */
/* */		/* */
/******************************************************************************/		/******************************************************************************/

#include "GPUJIT.h"		#include "GPUJIT.h"

		#ifdef HAS_LIBCUDART
#include <cuda.h>		#include <cuda.h>
#include <cuda_runtime.h>		#include <cuda_runtime.h>
		#endif /* HAS_LIBCUDART */

		#ifdef HAS_LIBOPENCL
		#ifdef __APPLE__
		#include <OpenCL/opencl.h>
		#else
		#include <CL/cl.h>
		#endif
		#endif /* HAS_LIBOPENCL */

#include <dlfcn.h>		#include <dlfcn.h>
#include <stdarg.h>		#include <stdarg.h>
#include <stdio.h>		#include <stdio.h>
#include <string.h>		#include <string.h>

static int DebugMode;		static int DebugMode;
static int CacheMode;		static int CacheMode;

		static PollyGPURuntime Runtime = RUNTIME_NONE;

static void debug_print(const char *format, ...) {		static void debug_print(const char *format, ...) {
if (!DebugMode)		if (!DebugMode)
return;		return;

va_list args;		va_list args;
va_start(args, format);		va_start(args, format);
vfprintf(stderr, format, args);		vfprintf(stderr, format, args);
va_end(args);		va_end(args);
}		}
#define dump_function() debug_print("-> %s\n", __func__)		#define dump_function() debug_print("-> %s\n", __func__)

/* Define Polly's GPGPU data types. */		#define KERNEL_CACHE_SIZE 10

		static void err_runtime() {
		fprintf(stderr, "Runtime not correctly initialized.\n");
		exit(-1);
		}

struct PollyGPUContextT {		struct PollyGPUContextT {
CUcontext Cuda;		void *Context;
};		};

struct PollyGPUFunctionT {		struct PollyGPUFunctionT {
		void *Kernel;
		};

		struct PollyGPUDevicePtrT {
		void *DevicePtr;
		};

		/******************************************************************************/
		/* OpenCL */
		/******************************************************************************/
		#ifdef HAS_LIBOPENCL

		struct OpenCLContextT {
		cl_context Context;
		cl_command_queue CommandQueue;
		};

		struct OpenCLKernelT {
		cl_kernel Kernel;
		cl_program Program;
		const char *BinaryString;
		};

		struct OpenCLDevicePtrT {
		cl_mem MemObj;
		};

		/* Dynamic library handles for the OpenCL runtime library. */
		static void *HandleOpenCL;

		/* Type-defines of function pointer to OpenCL Runtime API. */
		typedef cl_int clGetPlatformIDsFcnTy(cl_uint NumEntries,
		cl_platform_id *Platforms,
		cl_uint *NumPlatforms);
		static clGetPlatformIDsFcnTy *clGetPlatformIDsFcnPtr;

		typedef cl_int clGetDeviceIDsFcnTy(cl_platform_id Platform,
		cl_device_type DeviceType,
		cl_uint NumEntries, cl_device_id *Devices,
		cl_uint *NumDevices);
		static clGetDeviceIDsFcnTy *clGetDeviceIDsFcnPtr;

		typedef cl_int clGetDeviceInfoFcnTy(cl_device_id Device,
		cl_device_info ParamName,
		size_t ParamValueSize, void *ParamValue,
		size_t *ParamValueSizeRet);
		static clGetDeviceInfoFcnTy *clGetDeviceInfoFcnPtr;

		typedef cl_int clGetKernelInfoFcnTy(cl_kernel Kernel, cl_kernel_info ParamName,
		size_t ParamValueSize, void *ParamValue,
		size_t *ParamValueSizeRet);
		static clGetKernelInfoFcnTy *clGetKernelInfoFcnPtr;

		typedef cl_context clCreateContextFcnTy(
		const cl_context_properties *Properties, cl_uint NumDevices,
		const cl_device_id *Devices,
		void CL_CALLBACK pfn_notify(const char Errinfo, const void *PrivateInfo,
		size_t CB, void *UserData),
		void UserData, cl_int ErrcodeRet);
		static clCreateContextFcnTy *clCreateContextFcnPtr;

		typedef cl_command_queue
		clCreateCommandQueueFcnTy(cl_context Context, cl_device_id Device,
		cl_command_queue_properties Properties,
		cl_int *ErrcodeRet);
		static clCreateCommandQueueFcnTy *clCreateCommandQueueFcnPtr;

		typedef cl_mem clCreateBufferFcnTy(cl_context Context, cl_mem_flags Flags,
		size_t Size, void *HostPtr,
		cl_int *ErrcodeRet);
		static clCreateBufferFcnTy *clCreateBufferFcnPtr;

		typedef cl_int
		clEnqueueWriteBufferFcnTy(cl_command_queue CommandQueue, cl_mem Buffer,
		cl_bool BlockingWrite, size_t Offset, size_t Size,
		const void *Ptr, cl_uint NumEventsInWaitList,
		const cl_event EventWaitList, cl_event Event);
		static clEnqueueWriteBufferFcnTy *clEnqueueWriteBufferFcnPtr;

		typedef cl_program clCreateProgramWithBinaryFcnTy(
		cl_context Context, cl_uint NumDevices, const cl_device_id *DeviceList,
		const size_t Lengths, const unsigned char Binaries, cl_int BinaryStatus,
		cl_int *ErrcodeRet);
		static clCreateProgramWithBinaryFcnTy *clCreateProgramWithBinaryFcnPtr;

		typedef cl_int clBuildProgramFcnTy(
		cl_program Program, cl_uint NumDevices, const cl_device_id *DeviceList,
		const char *Options,
		void(CL_CALLBACK pfn_notify)(cl_program Program, void UserData),
		void *UserData);
		static clBuildProgramFcnTy *clBuildProgramFcnPtr;

		typedef cl_kernel clCreateKernelFcnTy(cl_program Program,
		const char *KernelName,
		cl_int *ErrcodeRet);
		static clCreateKernelFcnTy *clCreateKernelFcnPtr;

		typedef cl_int clSetKernelArgFcnTy(cl_kernel Kernel, cl_uint ArgIndex,
		size_t ArgSize, const void *ArgValue);
		static clSetKernelArgFcnTy *clSetKernelArgFcnPtr;

		typedef cl_int clEnqueueNDRangeKernelFcnTy(
		cl_command_queue CommandQueue, cl_kernel Kernel, cl_uint WorkDim,
		const size_t GlobalWorkOffset, const size_t GlobalWorkSize,
		const size_t *LocalWorkSize, cl_uint NumEventsInWaitList,
		const cl_event EventWaitList, cl_event Event);
		static clEnqueueNDRangeKernelFcnTy *clEnqueueNDRangeKernelFcnPtr;

		typedef cl_int clEnqueueReadBufferFcnTy(cl_command_queue CommandQueue,
		cl_mem Buffer, cl_bool BlockingRead,
		size_t Offset, size_t Size, void *Ptr,
		cl_uint NumEventsInWaitList,
		const cl_event *EventWaitList,
		cl_event *Event);
		static clEnqueueReadBufferFcnTy *clEnqueueReadBufferFcnPtr;

		typedef cl_int clFlushFcnTy(cl_command_queue CommandQueue);
		static clFlushFcnTy *clFlushFcnPtr;

		typedef cl_int clFinishFcnTy(cl_command_queue CommandQueue);
		static clFinishFcnTy *clFinishFcnPtr;

		typedef cl_int clReleaseKernelFcnTy(cl_kernel Kernel);
		static clReleaseKernelFcnTy *clReleaseKernelFcnPtr;

		typedef cl_int clReleaseProgramFcnTy(cl_program Program);
		static clReleaseProgramFcnTy *clReleaseProgramFcnPtr;

		typedef cl_int clReleaseMemObjectFcnTy(cl_mem Memobject);
		static clReleaseMemObjectFcnTy *clReleaseMemObjectFcnPtr;

		typedef cl_int clReleaseCommandQueueFcnTy(cl_command_queue CommandQueue);
		static clReleaseCommandQueueFcnTy *clReleaseCommandQueueFcnPtr;

		typedef cl_int clReleaseContextFcnTy(cl_context Context);
		static clReleaseContextFcnTy *clReleaseContextFcnPtr;

		static void getAPIHandleCL(void Handle, const char *FuncName) {
		char *Err;
		void *FuncPtr;
		dlerror();
		FuncPtr = dlsym(Handle, FuncName);
		if ((Err = dlerror()) != 0) {
		fprintf(stderr, "Load OpenCL Runtime API failed: %s. \n", Err);
		return 0;
		}
		return FuncPtr;
		}

		static int initialDeviceAPILibrariesCL() {
		HandleOpenCL = dlopen("libOpenCL.so", RTLD_LAZY);
		if (!HandleOpenCL) {
		fprintf(stderr, "Cannot open library: %s. \n", dlerror());
		return 0;
		}
		return 1;
		}

		static int initialDeviceAPIsCL() {
		if (initialDeviceAPILibrariesCL() == 0)
		return 0;

		/* Get function pointer to OpenCL Runtime API.
		*
		* Note that compilers conforming to the ISO C standard are required to
		* generate a warning if a conversion from a void * pointer to a function
		* pointer is attempted as in the following statements. The warning
		* of this kind of cast may not be emitted by clang and new versions of gcc
		* as it is valid on POSIX 2008.
		*/
		clGetPlatformIDsFcnPtr =
		(clGetPlatformIDsFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetPlatformIDs");

		clGetDeviceIDsFcnPtr =
		(clGetDeviceIDsFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetDeviceIDs");

		clGetDeviceInfoFcnPtr =
		(clGetDeviceInfoFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetDeviceInfo");

		clGetKernelInfoFcnPtr =
		(clGetKernelInfoFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetKernelInfo");

		clCreateContextFcnPtr =
		(clCreateContextFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateContext");

		clCreateCommandQueueFcnPtr = (clCreateCommandQueueFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clCreateCommandQueue");

		clCreateBufferFcnPtr =
		(clCreateBufferFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateBuffer");

		clEnqueueWriteBufferFcnPtr = (clEnqueueWriteBufferFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clEnqueueWriteBuffer");

		clCreateProgramWithBinaryFcnPtr =
		(clCreateProgramWithBinaryFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clCreateProgramWithBinary");

		clBuildProgramFcnPtr =
		(clBuildProgramFcnTy *)getAPIHandleCL(HandleOpenCL, "clBuildProgram");

		clCreateKernelFcnPtr =
		(clCreateKernelFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateKernel");

		clSetKernelArgFcnPtr =
		(clSetKernelArgFcnTy *)getAPIHandleCL(HandleOpenCL, "clSetKernelArg");

		clEnqueueNDRangeKernelFcnPtr = (clEnqueueNDRangeKernelFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clEnqueueNDRangeKernel");

		clEnqueueReadBufferFcnPtr = (clEnqueueReadBufferFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clEnqueueReadBuffer");

		clFlushFcnPtr = (clFlushFcnTy *)getAPIHandleCL(HandleOpenCL, "clFlush");

		clFinishFcnPtr = (clFinishFcnTy *)getAPIHandleCL(HandleOpenCL, "clFinish");

		clReleaseKernelFcnPtr =
		(clReleaseKernelFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseKernel");

		clReleaseProgramFcnPtr =
		(clReleaseProgramFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseProgram");

		clReleaseMemObjectFcnPtr = (clReleaseMemObjectFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clReleaseMemObject");

		clReleaseCommandQueueFcnPtr = (clReleaseCommandQueueFcnTy *)getAPIHandleCL(
		HandleOpenCL, "clReleaseCommandQueue");

		clReleaseContextFcnPtr =
		(clReleaseContextFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseContext");

		return 1;
		}

		/* Context and Device. */
		static PollyGPUContext *GlobalContext = NULL;
		static cl_device_id GlobalDeviceID = NULL;

		/* Fd-Decl: Print out OpenCL Error codes to human readable strings. */
		static void printOpenCLError(int Error);

		static void checkOpenCLError(int Ret, const char *format, ...) {
		if (Ret == CL_SUCCESS)
		return;

		printOpenCLError(Ret);
		va_list args;
		va_start(args, format);
		vfprintf(stderr, format, args);
		va_end(args);
		exit(-1);
		}

		static PollyGPUContext *initContextCL() {
		MeinersburUnsubmitted Done Reply Inline Actions Consistent variable name style? What style do you intend to use in this file? Meinersbur: Consistent variable name style? What style do you intend to use in this file?
		dump_function();

		PollyGPUContext *Context;

		cl_platform_id PlatformID = NULL;
		cl_device_id DeviceID = NULL;
		cl_uint NumDevicesRet;
		cl_int Ret;

		char DeviceRevision[256];
		char DeviceName[256];
		size_t DeviceRevisionRetSize, DeviceNameRetSize;

		static __thread PollyGPUContext *CurrentContext = NULL;

		if (CurrentContext)
		return CurrentContext;

		/* Get API handles. */
		if (initialDeviceAPIsCL() == 0) {
		fprintf(stderr, "Getting the \"handle\" for the OpenCL Runtime failed.\n");
		exit(-1);
		}

		/* Get number of devices that support OpenCL. */
		static const int NumberOfPlatforms = 1;
		Ret = clGetPlatformIDsFcnPtr(NumberOfPlatforms, &PlatformID, NULL);
		checkOpenCLError(Ret, "Failed to get platform IDs.\n");
		// TODO: Extend to CL_DEVICE_TYPE_ALL?
		static const int NumberOfDevices = 1;
		Ret = clGetDeviceIDsFcnPtr(PlatformID, CL_DEVICE_TYPE_GPU, NumberOfDevices,
		&DeviceID, &NumDevicesRet);
		checkOpenCLError(Ret, "Failed to get device IDs.\n");

		GlobalDeviceID = DeviceID;
		if (NumDevicesRet == 0) {
		fprintf(stderr, "There is no device supporting OpenCL.\n");
		MeinersburUnsubmitted Done Reply Inline Actions Replace the magic number 256 by `sizeof(DeviceRevision)`? Meinersbur: Replace the magic number 256 by `sizeof(DeviceRevision)`?
		exit(-1);
		}

		/* Get device revision. */
		Ret =
		clGetDeviceInfoFcnPtr(DeviceID, CL_DEVICE_VERSION, sizeof(DeviceRevision),
		DeviceRevision, &DeviceRevisionRetSize);
		checkOpenCLError(Ret, "Failed to fetch device revision.\n");

		/* Get device name. */
		Ret = clGetDeviceInfoFcnPtr(DeviceID, CL_DEVICE_NAME, sizeof(DeviceName),
		DeviceName, &DeviceNameRetSize);
		checkOpenCLError(Ret, "Failed to fetch device name.\n");

		debug_print("> Running on GPU device %d : %s.\n", DeviceID, DeviceName);

		/* Create context on the device. */
		Context = (PollyGPUContext *)malloc(sizeof(PollyGPUContext));
		if (Context == 0) {
		fprintf(stderr, "Allocate memory for Polly GPU context failed.\n");
		exit(-1);
		}
		Context->Context = (OpenCLContext *)malloc(sizeof(OpenCLContext));
		if (Context->Context == 0) {
		fprintf(stderr, "Allocate memory for Polly OpenCL context failed.\n");
		exit(-1);
		}
		((OpenCLContext *)Context->Context)->Context =
		clCreateContextFcnPtr(NULL, NumDevicesRet, &DeviceID, NULL, NULL, &Ret);
		checkOpenCLError(Ret, "Failed to create context.\n");

		static const int ExtraProperties = 0;
		((OpenCLContext *)Context->Context)->CommandQueue =
		clCreateCommandQueueFcnPtr(((OpenCLContext *)Context->Context)->Context,
		DeviceID, ExtraProperties, &Ret);
		checkOpenCLError(Ret, "Failed to create command queue.\n");

		if (CacheMode)
		CurrentContext = Context;

		GlobalContext = Context;
		return Context;
		}

		static void freeKernelCL(PollyGPUFunction *Kernel) {
		dump_function();

		if (CacheMode)
		MeinersburUnsubmitted Done Reply Inline Actions Did you consider introducing a new function this sequence of code? It appears quite often. Meinersbur: Did you consider introducing a new function this sequence of code? It appears quite often.
		return;

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		cl_int Ret;
		Ret = clFlushFcnPtr(((OpenCLContext *)GlobalContext->Context)->CommandQueue);
		checkOpenCLError(Ret, "Failed to flush command queue.\n");
		Ret = clFinishFcnPtr(((OpenCLContext *)GlobalContext->Context)->CommandQueue);
		checkOpenCLError(Ret, "Failed to finish command queue.\n");

		if (((OpenCLKernel *)Kernel->Kernel)->Kernel) {
		cl_int Ret =
		clReleaseKernelFcnPtr(((OpenCLKernel *)Kernel->Kernel)->Kernel);
		checkOpenCLError(Ret, "Failed to release kernel.\n");
		}

		if (((OpenCLKernel *)Kernel->Kernel)->Program) {
		cl_int Ret =
		clReleaseProgramFcnPtr(((OpenCLKernel *)Kernel->Kernel)->Program);
		checkOpenCLError(Ret, "Failed to release program.\n");
		}

		if (Kernel->Kernel)
		free((OpenCLKernel *)Kernel->Kernel);

		if (Kernel)
		free(Kernel);
		}

		static PollyGPUFunction getKernelCL(const char BinaryBuffer,
		const char *KernelName) {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		static __thread PollyGPUFunction *KernelCache[KERNEL_CACHE_SIZE];
		static __thread int NextCacheItem = 0;

		for (long i = 0; i < KERNEL_CACHE_SIZE; i++) {
		// We exploit here the property that all Polly-ACC kernels are allocated
		// as global constants, hence a pointer comparision is sufficient to
		// determin equality.
		if (KernelCache[i] &&
		((OpenCLKernel *)KernelCache[i]->Kernel)->BinaryString ==
		BinaryBuffer) {
		debug_print(" -> using cached kernel\n");
		return KernelCache[i];
		}
		}

		PollyGPUFunction *Function = malloc(sizeof(PollyGPUFunction));
		if (Function == 0) {
		fprintf(stderr, "Allocate memory for Polly GPU function failed.\n");
		exit(-1);
		}
		Function->Kernel = (OpenCLKernel *)malloc(sizeof(OpenCLKernel));
		if (Function->Kernel == 0) {
		fprintf(stderr, "Allocate memory for Polly OpenCL kernel failed.\n");
		exit(-1);
		}

		if (!GlobalDeviceID) {
		fprintf(stderr, "GPGPU-code generation not initialized correctly.\n");
		exit(-1);
		}

		cl_int Ret;
		size_t BinarySize = strlen(BinaryBuffer);
		((OpenCLKernel *)Function->Kernel)->Program = clCreateProgramWithBinaryFcnPtr(
		((OpenCLContext *)GlobalContext->Context)->Context, 1, &GlobalDeviceID,
		(const size_t )&BinarySize, (const unsigned char *)&BinaryBuffer, NULL,
		&Ret);
		checkOpenCLError(Ret, "Failed to create program from binary.\n");

		Ret = clBuildProgramFcnPtr(((OpenCLKernel *)Function->Kernel)->Program, 1,
		&GlobalDeviceID, NULL, NULL, NULL);
		checkOpenCLError(Ret, "Failed to build program.\n");

		((OpenCLKernel *)Function->Kernel)->Kernel = clCreateKernelFcnPtr(
		((OpenCLKernel *)Function->Kernel)->Program, KernelName, &Ret);
		checkOpenCLError(Ret, "Failed to create kernel.\n");

		((OpenCLKernel *)Function->Kernel)->BinaryString = BinaryBuffer;

		if (CacheMode) {
		if (KernelCache[NextCacheItem])
		freeKernelCL(KernelCache[NextCacheItem]);

		KernelCache[NextCacheItem] = Function;

		NextCacheItem = (NextCacheItem + 1) % KERNEL_CACHE_SIZE;
		}

		return Function;
		}

		static void copyFromHostToDeviceCL(void HostData, PollyGPUDevicePtr DevData,
		long MemSize) {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		cl_int Ret;
		Ret = clEnqueueWriteBufferFcnPtr(
		((OpenCLContext *)GlobalContext->Context)->CommandQueue,
		((OpenCLDevicePtr *)DevData->DevicePtr)->MemObj, CL_TRUE, 0, MemSize,
		HostData, 0, NULL, NULL);
		checkOpenCLError(Ret, "Copying data from host memory to device failed.\n");
		}

		static void copyFromDeviceToHostCL(PollyGPUDevicePtr DevData, void HostData,
		long MemSize) {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		cl_int Ret;
		Ret = clEnqueueReadBufferFcnPtr(
		((OpenCLContext *)GlobalContext->Context)->CommandQueue,
		((OpenCLDevicePtr *)DevData->DevicePtr)->MemObj, CL_TRUE, 0, MemSize,
		HostData, 0, NULL, NULL);
		checkOpenCLError(Ret, "Copying results from device to host memory failed.\n");
		}

		static void launchKernelCL(PollyGPUFunction *Kernel, unsigned int GridDimX,
		unsigned int GridDimY, unsigned int BlockDimX,
		unsigned int BlockDimY, unsigned int BlockDimZ,
		void **Parameters) {
		dump_function();

		cl_int Ret;
		cl_uint NumArgs;

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		OpenCLKernel CLKernel = (OpenCLKernel )Kernel->Kernel;
		Ret = clGetKernelInfoFcnPtr(CLKernel->Kernel, CL_KERNEL_NUM_ARGS,
		sizeof(cl_uint), &NumArgs, NULL);
		checkOpenCLError(Ret, "Failed to get number of kernel arguments.\n");

		// TODO: Pass the size of the kernel arguments in to launchKernelCL, along
		// with the arguments themselves. This is a dirty workaround that can be
		// broken.
		for (cl_uint i = 0; i < NumArgs; i++) {
		Ret = clSetKernelArgFcnPtr(CLKernel->Kernel, i, 8, (void *)Parameters[i]);
		if (Ret == CL_INVALID_ARG_SIZE) {
		Ret = clSetKernelArgFcnPtr(CLKernel->Kernel, i, 4, (void *)Parameters[i]);
		if (Ret == CL_INVALID_ARG_SIZE) {
		Ret =
		clSetKernelArgFcnPtr(CLKernel->Kernel, i, 2, (void *)Parameters[i]);
		if (Ret == CL_INVALID_ARG_SIZE) {
		Ret = clSetKernelArgFcnPtr(CLKernel->Kernel, i, 1,
		(void *)Parameters[i]);
		checkOpenCLError(Ret, "Failed to set Kernel argument %d.\n", i);
		}
		}
		}
		if (Ret != CL_SUCCESS && Ret != CL_INVALID_ARG_SIZE) {
		fprintf(stderr, "Failed to set Kernel argument.\n");
		printOpenCLError(Ret);
		exit(-1);
		}
		}

		unsigned int GridDimZ = 1;
		size_t GlobalWorkSize[3] = {BlockDimX * GridDimX, BlockDimY * GridDimY,
		BlockDimZ * GridDimZ};
		size_t LocalWorkSize[3] = {BlockDimX, BlockDimY, BlockDimZ};

		static const int WorkDim = 3;
		OpenCLContext CLContext = (OpenCLContext )GlobalContext->Context;
		Ret = clEnqueueNDRangeKernelFcnPtr(CLContext->CommandQueue, CLKernel->Kernel,
		WorkDim, NULL, GlobalWorkSize,
		LocalWorkSize, 0, NULL, NULL);
		checkOpenCLError(Ret, "Launching OpenCL kernel failed.\n");
		}

		static void freeDeviceMemoryCL(PollyGPUDevicePtr *Allocation) {
		dump_function();

		OpenCLDevicePtr DevPtr = (OpenCLDevicePtr )Allocation->DevicePtr;
		cl_int Ret = clReleaseMemObjectFcnPtr((cl_mem)DevPtr->MemObj);
		checkOpenCLError(Ret, "Failed to free device memory.\n");

		free(DevPtr);
		free(Allocation);
		}

		static PollyGPUDevicePtr *allocateMemoryForDeviceCL(long MemSize) {
		dump_function();

		MeinersburUnsubmitted Not Done Reply Inline Actions Thanks for the introduction of `checkOpenCLError`. You could also introduce one for these two lines. for instance: if (!GlobalContext) handleError("GPGPU-code generation not correctly initialized.\n"); `handleError` could also be called by `checkOpenCLError`. It helps centralising the error handling, such that if we change some detail about it (e.g. the return code on exit, or some cleanup code), there is a single function for that. Meinersbur: Thanks for the introduction of `checkOpenCLError`. You could also introduce one for these two…
		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		PollyGPUDevicePtr *DevData = malloc(sizeof(PollyGPUDevicePtr));
		if (DevData == 0) {
		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
		exit(-1);
		}
		DevData->DevicePtr = (OpenCLDevicePtr *)malloc(sizeof(OpenCLDevicePtr));
		MeinersburUnsubmitted Not Done Reply Inline Actions Trying each argument size after the other and hoping one matches is not good. The caller must know the argument sizes. You probably have to pass the sizes in another argument to `launchKernelCL` that contains those sizes for each argument, generated by Polly. Without this, the code will fail if you pass a struct (or vector) of size other than 8, 4, 2, or 1. Meinersbur: Trying each argument size after the other and hoping one matches is not good. The caller must…
		PhilippSchaadAuthorUnsubmitted Not Done Reply Inline Actions Yes, this is a priority issue still. The issue will have to be resolved at some point. This is basically a temporary way around some (probably) major argument handling changes in PPCG etc. PhilippSchaad: Yes, this is a priority issue still. The issue will have to be resolved at some point. This is…
		MeinersburUnsubmitted Not Done Reply Inline Actions Also note that I am not sure that OpenCL ICD's are required to check for correct `CL_INVALID_ARG_SIZE`. It might just trust the caller, or be a badly written one. Meinersbur: Also note that I am not sure that OpenCL ICD's are required to check for correct…
		if (DevData->DevicePtr == 0) {
		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
		exit(-1);
		}

		cl_int Ret;
		((OpenCLDevicePtr *)DevData->DevicePtr)->MemObj =
		clCreateBufferFcnPtr(((OpenCLContext *)GlobalContext->Context)->Context,
		CL_MEM_READ_WRITE, MemSize, NULL, &Ret);
		checkOpenCLError(Ret,
		"Allocate memory for GPU device memory pointer failed.\n");

		return DevData;
		}

		static void getDevicePtrCL(PollyGPUDevicePtr Allocation) {
		dump_function();

		OpenCLDevicePtr DevPtr = (OpenCLDevicePtr )Allocation->DevicePtr;
		return (void *)DevPtr->MemObj;
		}

		static void synchronizeDeviceCL() {
		dump_function();

		if (!GlobalContext) {
		fprintf(stderr, "GPGPU-code generation not correctly initialized.\n");
		exit(-1);
		}

		if (clFinishFcnPtr(((OpenCLContext *)GlobalContext->Context)->CommandQueue) !=
		CL_SUCCESS) {
		fprintf(stderr, "Synchronizing device and host memory failed.\n");
		exit(-1);
		}
		}

		static void freeContextCL(PollyGPUContext *Context) {
		dump_function();

		cl_int Ret;

		GlobalContext = NULL;

		OpenCLContext Ctx = (OpenCLContext )Context->Context;
		if (Ctx->CommandQueue) {
		Ret = clReleaseCommandQueueFcnPtr(Ctx->CommandQueue);
		checkOpenCLError(Ret, "Could not release command queue.\n");
		}

		if (Ctx->Context) {
		Ret = clReleaseContextFcnPtr(Ctx->Context);
		checkOpenCLError(Ret, "Could not release context.\n");
		}

		free(Ctx);
		free(Context);
		}
		MeinersburUnsubmitted Done Reply Inline Actions Shouldn't these print to `stderr`? Meinersbur: Shouldn't these print to `stderr`?

		static void printOpenCLError(int Error) {

		switch (Error) {
		case CL_SUCCESS:
		// Success, don't print an error.
		break;

		// JIT/Runtime errors.
		case CL_DEVICE_NOT_FOUND:
		fprintf(stderr, "Device not found.\n");
		break;
		case CL_DEVICE_NOT_AVAILABLE:
		fprintf(stderr, "Device not available.\n");
		break;
		case CL_COMPILER_NOT_AVAILABLE:
		fprintf(stderr, "Compiler not available.\n");
		break;
		case CL_MEM_OBJECT_ALLOCATION_FAILURE:
		fprintf(stderr, "Mem object allocation failure.\n");
		break;
		case CL_OUT_OF_RESOURCES:
		fprintf(stderr, "Out of resources.\n");
		break;
		case CL_OUT_OF_HOST_MEMORY:
		fprintf(stderr, "Out of host memory.\n");
		break;
		case CL_PROFILING_INFO_NOT_AVAILABLE:
		fprintf(stderr, "Profiling info not available.\n");
		break;
		case CL_MEM_COPY_OVERLAP:
		fprintf(stderr, "Mem copy overlap.\n");
		break;
		case CL_IMAGE_FORMAT_MISMATCH:
		fprintf(stderr, "Image format mismatch.\n");
		break;
		case CL_IMAGE_FORMAT_NOT_SUPPORTED:
		fprintf(stderr, "Image format not supported.\n");
		break;
		case CL_BUILD_PROGRAM_FAILURE:
		fprintf(stderr, "Build program failure.\n");
		break;
		case CL_MAP_FAILURE:
		fprintf(stderr, "Map failure.\n");
		break;
		case CL_MISALIGNED_SUB_BUFFER_OFFSET:
		fprintf(stderr, "Misaligned sub buffer offset.\n");
		break;
		case CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST:
		fprintf(stderr, "Exec status error for events in wait list.\n");
		break;
		case CL_COMPILE_PROGRAM_FAILURE:
		fprintf(stderr, "Compile program failure.\n");
		break;
		case CL_LINKER_NOT_AVAILABLE:
		fprintf(stderr, "Linker not available.\n");
		break;
		case CL_LINK_PROGRAM_FAILURE:
		fprintf(stderr, "Link program failure.\n");
		break;
		case CL_DEVICE_PARTITION_FAILED:
		fprintf(stderr, "Device partition failed.\n");
		break;
		case CL_KERNEL_ARG_INFO_NOT_AVAILABLE:
		fprintf(stderr, "Kernel arg info not available.\n");
		break;

		// Compiler errors.
		case CL_INVALID_VALUE:
		fprintf(stderr, "Invalid value.\n");
		break;
		case CL_INVALID_DEVICE_TYPE:
		fprintf(stderr, "Invalid device type.\n");
		break;
		MeinersburUnsubmitted Done Reply Inline Actions The function name does not follow the naming of other functions in this file. In C it is common have the public API functions prefixed with the library name (here: "polly") and everything else static. Don't choose the prefix of another library (here: "cl_"). This avoids symbol conflicts because multiple libraries happen to give the same name for a function. Meinersbur: The function name does not follow the naming of other functions in this file. In C it is common…
		case CL_INVALID_PLATFORM:
		fprintf(stderr, "Invalid platform.\n");
		break;
		case CL_INVALID_DEVICE:
		fprintf(stderr, "Invalid device.\n");
		break;
		case CL_INVALID_CONTEXT:
		fprintf(stderr, "Invalid context.\n");
		break;
		case CL_INVALID_QUEUE_PROPERTIES:
		fprintf(stderr, "Invalid queue properties.\n");
		break;
		case CL_INVALID_COMMAND_QUEUE:
		fprintf(stderr, "Invalid command queue.\n");
		break;
		case CL_INVALID_HOST_PTR:
		fprintf(stderr, "Invalid host pointer.\n");
		break;
		case CL_INVALID_MEM_OBJECT:
		fprintf(stderr, "Invalid memory object.\n");
		break;
		case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR:
		fprintf(stderr, "Invalid image format descriptor.\n");
		break;
		case CL_INVALID_IMAGE_SIZE:
		fprintf(stderr, "Invalid image size.\n");
		break;
		case CL_INVALID_SAMPLER:
		fprintf(stderr, "Invalid sampler.\n");
		break;
		case CL_INVALID_BINARY:
		fprintf(stderr, "Invalid binary.\n");
		break;
		case CL_INVALID_BUILD_OPTIONS:
		fprintf(stderr, "Invalid build options.\n");
		break;
		case CL_INVALID_PROGRAM:
		fprintf(stderr, "Invalid program.\n");
		break;
		case CL_INVALID_PROGRAM_EXECUTABLE:
		fprintf(stderr, "Invalid program executable.\n");
		break;
		case CL_INVALID_KERNEL_NAME:
		fprintf(stderr, "Invalid kernel name.\n");
		break;
		case CL_INVALID_KERNEL_DEFINITION:
		fprintf(stderr, "Invalid kernel definition.\n");
		break;
		case CL_INVALID_KERNEL:
		fprintf(stderr, "Invalid kernel.\n");
		break;
		case CL_INVALID_ARG_INDEX:
		fprintf(stderr, "Invalid arg index.\n");
		break;
		case CL_INVALID_ARG_VALUE:
		fprintf(stderr, "Invalid arg value.\n");
		break;
		case CL_INVALID_ARG_SIZE:
		fprintf(stderr, "Invalid arg size.\n");
		break;
		case CL_INVALID_KERNEL_ARGS:
		fprintf(stderr, "Invalid kernel args.\n");
		break;
		case CL_INVALID_WORK_DIMENSION:
		fprintf(stderr, "Invalid work dimension.\n");
		break;
		case CL_INVALID_WORK_GROUP_SIZE:
		fprintf(stderr, "Invalid work group size.\n");
		break;
		case CL_INVALID_WORK_ITEM_SIZE:
		fprintf(stderr, "Invalid work item size.\n");
		break;
		case CL_INVALID_GLOBAL_OFFSET:
		fprintf(stderr, "Invalid global offset.\n");
		break;
		case CL_INVALID_EVENT_WAIT_LIST:
		fprintf(stderr, "Invalid event wait list.\n");
		break;
		case CL_INVALID_EVENT:
		fprintf(stderr, "Invalid event.\n");
		break;
		case CL_INVALID_OPERATION:
		fprintf(stderr, "Invalid operation.\n");
		break;
		case CL_INVALID_GL_OBJECT:
		fprintf(stderr, "Invalid GL object.\n");
		break;
		case CL_INVALID_BUFFER_SIZE:
		fprintf(stderr, "Invalid buffer size.\n");
		break;
		case CL_INVALID_MIP_LEVEL:
		fprintf(stderr, "Invalid mip level.\n");
		break;
		case CL_INVALID_GLOBAL_WORK_SIZE:
		fprintf(stderr, "Invalid global work size.\n");
		break;
		case CL_INVALID_PROPERTY:
		fprintf(stderr, "Invalid property.\n");
		break;
		case CL_INVALID_IMAGE_DESCRIPTOR:
		fprintf(stderr, "Invalid image descriptor.\n");
		break;
		case CL_INVALID_COMPILER_OPTIONS:
		fprintf(stderr, "Invalid compiler options.\n");
		break;
		case CL_INVALID_LINKER_OPTIONS:
		fprintf(stderr, "Invalid linker options.\n");
		break;
		case CL_INVALID_DEVICE_PARTITION_COUNT:
		fprintf(stderr, "Invalid device partition count.\n");
		break;
		case CL_INVALID_PIPE_SIZE:
		fprintf(stderr, "Invalid pipe size.\n");
		break;
		case CL_INVALID_DEVICE_QUEUE:
		fprintf(stderr, "Invalid device queue.\n");
		break;

		// NVIDIA specific error.
		case -9999:
		fprintf(stderr, "NVIDIA invalid read or write buffer.\n");
		break;

		default:
		fprintf(stderr, "Unknown error code!\n");
		break;
		}
		}

		#endif /* HAS_LIBOPENCL */
		/******************************************************************************/
		/* CUDA */
		/******************************************************************************/
		#ifdef HAS_LIBCUDART

		struct CUDAContextT {
		CUcontext Cuda;
		};

		struct CUDAKernelT {
CUfunction Cuda;		CUfunction Cuda;
CUmodule CudaModule;		CUmodule CudaModule;
const char *PTXString;		const char *BinaryString;
};		};

struct PollyGPUDevicePtrT {		struct CUDADevicePtrT {
CUdeviceptr Cuda;		CUdeviceptr Cuda;
};		};

/* Dynamic library handles for the CUDA and CUDA runtime library. */		/* Dynamic library handles for the CUDA and CUDA runtime library. */
static void *HandleCuda;		static void *HandleCuda;
static void *HandleCudaRT;		static void *HandleCudaRT;

/* Type-defines of function pointer to CUDA driver APIs. */		/* Type-defines of function pointer to CUDA driver APIs. */
typedef CUresult CUDAAPI CuMemAllocFcnTy(CUdeviceptr *, size_t);		typedef CUresult CUDAAPI CuMemAllocFcnTy(CUdeviceptr *, size_t);
static CuMemAllocFcnTy *CuMemAllocFcnPtr;		static CuMemAllocFcnTy *CuMemAllocFcnPtr;

typedef CUresult CUDAAPI CuLaunchKernelFcnTy(		typedef CUresult CUDAAPI CuLaunchKernelFcnTy(
CUfunction f, unsigned int gridDimX, unsigned int gridDimY,		CUfunction F, unsigned int GridDimX, unsigned int GridDimY,
unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY,		unsigned int gridDimZ, unsigned int blockDimX, unsigned int BlockDimY,
unsigned int blockDimZ, unsigned int sharedMemBytes, CUstream hStream,		unsigned int BlockDimZ, unsigned int SharedMemBytes, CUstream HStream,
void kernelParams, void extra);		void KernelParams, void Extra);
static CuLaunchKernelFcnTy *CuLaunchKernelFcnPtr;		static CuLaunchKernelFcnTy *CuLaunchKernelFcnPtr;

typedef CUresult CUDAAPI CuMemcpyDtoHFcnTy(void *, CUdeviceptr, size_t);		typedef CUresult CUDAAPI CuMemcpyDtoHFcnTy(void *, CUdeviceptr, size_t);
static CuMemcpyDtoHFcnTy *CuMemcpyDtoHFcnPtr;		static CuMemcpyDtoHFcnTy *CuMemcpyDtoHFcnPtr;

typedef CUresult CUDAAPI CuMemcpyHtoDFcnTy(CUdeviceptr, const void *, size_t);		typedef CUresult CUDAAPI CuMemcpyHtoDFcnTy(CUdeviceptr, const void *, size_t);
static CuMemcpyHtoDFcnTy *CuMemcpyHtoDFcnPtr;		static CuMemcpyHtoDFcnTy *CuMemcpyHtoDFcnPtr;

Show All 18 Lines
typedef CUresult CUDAAPI CuDeviceGetFcnTy(CUdevice *, int);		typedef CUresult CUDAAPI CuDeviceGetFcnTy(CUdevice *, int);
static CuDeviceGetFcnTy *CuDeviceGetFcnPtr;		static CuDeviceGetFcnTy *CuDeviceGetFcnPtr;

typedef CUresult CUDAAPI CuModuleLoadDataExFcnTy(CUmodule , const void ,		typedef CUresult CUDAAPI CuModuleLoadDataExFcnTy(CUmodule , const void ,
unsigned int, CUjit_option *,		unsigned int, CUjit_option *,
void **);		void **);
static CuModuleLoadDataExFcnTy *CuModuleLoadDataExFcnPtr;		static CuModuleLoadDataExFcnTy *CuModuleLoadDataExFcnPtr;

typedef CUresult CUDAAPI CuModuleLoadDataFcnTy(CUmodule *module,		typedef CUresult CUDAAPI CuModuleLoadDataFcnTy(CUmodule *Module,
const void *image);		const void *Image);
static CuModuleLoadDataFcnTy *CuModuleLoadDataFcnPtr;		static CuModuleLoadDataFcnTy *CuModuleLoadDataFcnPtr;

typedef CUresult CUDAAPI CuModuleGetFunctionFcnTy(CUfunction *, CUmodule,		typedef CUresult CUDAAPI CuModuleGetFunctionFcnTy(CUfunction *, CUmodule,
const char *);		const char *);
static CuModuleGetFunctionFcnTy *CuModuleGetFunctionFcnPtr;		static CuModuleGetFunctionFcnTy *CuModuleGetFunctionFcnPtr;

typedef CUresult CUDAAPI CuDeviceComputeCapabilityFcnTy(int , int , CUdevice);		typedef CUresult CUDAAPI CuDeviceComputeCapabilityFcnTy(int , int , CUdevice);
static CuDeviceComputeCapabilityFcnTy *CuDeviceComputeCapabilityFcnPtr;		static CuDeviceComputeCapabilityFcnTy *CuDeviceComputeCapabilityFcnPtr;

typedef CUresult CUDAAPI CuDeviceGetNameFcnTy(char *, int, CUdevice);		typedef CUresult CUDAAPI CuDeviceGetNameFcnTy(char *, int, CUdevice);
static CuDeviceGetNameFcnTy *CuDeviceGetNameFcnPtr;		static CuDeviceGetNameFcnTy *CuDeviceGetNameFcnPtr;

typedef CUresult CUDAAPI CuLinkAddDataFcnTy(CUlinkState state,		typedef CUresult CUDAAPI CuLinkAddDataFcnTy(CUlinkState State,
CUjitInputType type, void *data,		CUjitInputType Type, void *Data,
size_t size, const char *name,		size_t Size, const char *Name,
unsigned int numOptions,		unsigned int NumOptions,
CUjit_option *options,		CUjit_option *Options,
void **optionValues);		void **OptionValues);
static CuLinkAddDataFcnTy *CuLinkAddDataFcnPtr;		static CuLinkAddDataFcnTy *CuLinkAddDataFcnPtr;
		MeinersburUnsubmitted Done Reply Inline Actions These are unrelated changes Tobias usually complains about. I personally don't care. Meinersbur: These are unrelated changes Tobias usually complains about. I personally don't care.

typedef CUresult CUDAAPI CuLinkCreateFcnTy(unsigned int numOptions,		typedef CUresult CUDAAPI CuLinkCreateFcnTy(unsigned int NumOptions,
CUjit_option *options,		CUjit_option *Options,
void **optionValues,		void **OptionValues,
CUlinkState *stateOut);		CUlinkState *StateOut);
static CuLinkCreateFcnTy *CuLinkCreateFcnPtr;		static CuLinkCreateFcnTy *CuLinkCreateFcnPtr;

typedef CUresult CUDAAPI CuLinkCompleteFcnTy(CUlinkState state, void **cubinOut,		typedef CUresult CUDAAPI CuLinkCompleteFcnTy(CUlinkState State, void **CubinOut,
size_t *sizeOut);		size_t *SizeOut);
static CuLinkCompleteFcnTy *CuLinkCompleteFcnPtr;		static CuLinkCompleteFcnTy *CuLinkCompleteFcnPtr;

typedef CUresult CUDAAPI CuLinkDestroyFcnTy(CUlinkState state);		typedef CUresult CUDAAPI CuLinkDestroyFcnTy(CUlinkState State);
static CuLinkDestroyFcnTy *CuLinkDestroyFcnPtr;		static CuLinkDestroyFcnTy *CuLinkDestroyFcnPtr;

typedef CUresult CUDAAPI CuCtxSynchronizeFcnTy();		typedef CUresult CUDAAPI CuCtxSynchronizeFcnTy();
static CuCtxSynchronizeFcnTy *CuCtxSynchronizeFcnPtr;		static CuCtxSynchronizeFcnTy *CuCtxSynchronizeFcnPtr;

/* Type-defines of function pointer ot CUDA runtime APIs. */		/* Type-defines of function pointer ot CUDA runtime APIs. */
typedef cudaError_t CUDARTAPI CudaThreadSynchronizeFcnTy(void);		typedef cudaError_t CUDARTAPI CudaThreadSynchronizeFcnTy(void);
static CudaThreadSynchronizeFcnTy *CudaThreadSynchronizeFcnPtr;		static CudaThreadSynchronizeFcnTy *CudaThreadSynchronizeFcnPtr;

static void getAPIHandle(void Handle, const char *FuncName) {		static void getAPIHandleCUDA(void Handle, const char *FuncName) {
char *Err;		char *Err;
void *FuncPtr;		void *FuncPtr;
dlerror();		dlerror();
FuncPtr = dlsym(Handle, FuncName);		FuncPtr = dlsym(Handle, FuncName);
if ((Err = dlerror()) != 0) {		if ((Err = dlerror()) != 0) {
fprintf(stdout, "Load CUDA driver API failed: %s. \n", Err);		fprintf(stderr, "Load CUDA driver API failed: %s. \n", Err);
return 0;		return 0;
}		}
return FuncPtr;		return FuncPtr;
}		}

static int initialDeviceAPILibraries() {		static int initialDeviceAPILibrariesCUDA() {
HandleCuda = dlopen("libcuda.so", RTLD_LAZY);		HandleCuda = dlopen("libcuda.so", RTLD_LAZY);
if (!HandleCuda) {		if (!HandleCuda) {
printf("Cannot open library: %s. \n", dlerror());		fprintf(stderr, "Cannot open library: %s. \n", dlerror());
return 0;		return 0;
}		}

HandleCudaRT = dlopen("libcudart.so", RTLD_LAZY);		HandleCudaRT = dlopen("libcudart.so", RTLD_LAZY);
if (!HandleCudaRT) {		if (!HandleCudaRT) {
printf("Cannot open library: %s. \n", dlerror());		fprintf(stderr, "Cannot open library: %s. \n", dlerror());
return 0;		return 0;
}		}

return 1;		return 1;
}		}

static int initialDeviceAPIs() {		static int initialDeviceAPIsCUDA() {
if (initialDeviceAPILibraries() == 0)		if (initialDeviceAPILibrariesCUDA() == 0)
return 0;		return 0;

/* Get function pointer to CUDA Driver APIs.		/* Get function pointer to CUDA Driver APIs.
*		*
* Note that compilers conforming to the ISO C standard are required to		* Note that compilers conforming to the ISO C standard are required to
* generate a warning if a conversion from a void * pointer to a function		* generate a warning if a conversion from a void * pointer to a function
* pointer is attempted as in the following statements. The warning		* pointer is attempted as in the following statements. The warning
* of this kind of cast may not be emitted by clang and new versions of gcc		* of this kind of cast may not be emitted by clang and new versions of gcc
* as it is valid on POSIX 2008.		* as it is valid on POSIX 2008.
*/		*/
CuLaunchKernelFcnPtr =		CuLaunchKernelFcnPtr =
(CuLaunchKernelFcnTy *)getAPIHandle(HandleCuda, "cuLaunchKernel");		(CuLaunchKernelFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLaunchKernel");

CuMemAllocFcnPtr =		CuMemAllocFcnPtr =
(CuMemAllocFcnTy *)getAPIHandle(HandleCuda, "cuMemAlloc_v2");		(CuMemAllocFcnTy *)getAPIHandleCUDA(HandleCuda, "cuMemAlloc_v2");

CuMemFreeFcnPtr = (CuMemFreeFcnTy *)getAPIHandle(HandleCuda, "cuMemFree_v2");		CuMemFreeFcnPtr =
		(CuMemFreeFcnTy *)getAPIHandleCUDA(HandleCuda, "cuMemFree_v2");

CuMemcpyDtoHFcnPtr =		CuMemcpyDtoHFcnPtr =
(CuMemcpyDtoHFcnTy *)getAPIHandle(HandleCuda, "cuMemcpyDtoH_v2");		(CuMemcpyDtoHFcnTy *)getAPIHandleCUDA(HandleCuda, "cuMemcpyDtoH_v2");

CuMemcpyHtoDFcnPtr =		CuMemcpyHtoDFcnPtr =
(CuMemcpyHtoDFcnTy *)getAPIHandle(HandleCuda, "cuMemcpyHtoD_v2");		(CuMemcpyHtoDFcnTy *)getAPIHandleCUDA(HandleCuda, "cuMemcpyHtoD_v2");

CuModuleUnloadFcnPtr =		CuModuleUnloadFcnPtr =
(CuModuleUnloadFcnTy *)getAPIHandle(HandleCuda, "cuModuleUnload");		(CuModuleUnloadFcnTy *)getAPIHandleCUDA(HandleCuda, "cuModuleUnload");

CuCtxDestroyFcnPtr =		CuCtxDestroyFcnPtr =
(CuCtxDestroyFcnTy *)getAPIHandle(HandleCuda, "cuCtxDestroy");		(CuCtxDestroyFcnTy *)getAPIHandleCUDA(HandleCuda, "cuCtxDestroy");

CuInitFcnPtr = (CuInitFcnTy *)getAPIHandle(HandleCuda, "cuInit");		CuInitFcnPtr = (CuInitFcnTy *)getAPIHandleCUDA(HandleCuda, "cuInit");

CuDeviceGetCountFcnPtr =		CuDeviceGetCountFcnPtr =
(CuDeviceGetCountFcnTy *)getAPIHandle(HandleCuda, "cuDeviceGetCount");		(CuDeviceGetCountFcnTy *)getAPIHandleCUDA(HandleCuda, "cuDeviceGetCount");

CuDeviceGetFcnPtr =		CuDeviceGetFcnPtr =
(CuDeviceGetFcnTy *)getAPIHandle(HandleCuda, "cuDeviceGet");		(CuDeviceGetFcnTy *)getAPIHandleCUDA(HandleCuda, "cuDeviceGet");

CuCtxCreateFcnPtr =		CuCtxCreateFcnPtr =
(CuCtxCreateFcnTy *)getAPIHandle(HandleCuda, "cuCtxCreate_v2");		(CuCtxCreateFcnTy *)getAPIHandleCUDA(HandleCuda, "cuCtxCreate_v2");

CuModuleLoadDataExFcnPtr =		CuModuleLoadDataExFcnPtr = (CuModuleLoadDataExFcnTy *)getAPIHandleCUDA(
(CuModuleLoadDataExFcnTy *)getAPIHandle(HandleCuda, "cuModuleLoadDataEx");		HandleCuda, "cuModuleLoadDataEx");

CuModuleLoadDataFcnPtr =		CuModuleLoadDataFcnPtr =
(CuModuleLoadDataFcnTy *)getAPIHandle(HandleCuda, "cuModuleLoadData");		(CuModuleLoadDataFcnTy *)getAPIHandleCUDA(HandleCuda, "cuModuleLoadData");

CuModuleGetFunctionFcnPtr = (CuModuleGetFunctionFcnTy *)getAPIHandle(		CuModuleGetFunctionFcnPtr = (CuModuleGetFunctionFcnTy *)getAPIHandleCUDA(
HandleCuda, "cuModuleGetFunction");		HandleCuda, "cuModuleGetFunction");

CuDeviceComputeCapabilityFcnPtr =		CuDeviceComputeCapabilityFcnPtr =
(CuDeviceComputeCapabilityFcnTy *)getAPIHandle(		(CuDeviceComputeCapabilityFcnTy *)getAPIHandleCUDA(
HandleCuda, "cuDeviceComputeCapability");		HandleCuda, "cuDeviceComputeCapability");

CuDeviceGetNameFcnPtr =		CuDeviceGetNameFcnPtr =
(CuDeviceGetNameFcnTy *)getAPIHandle(HandleCuda, "cuDeviceGetName");		(CuDeviceGetNameFcnTy *)getAPIHandleCUDA(HandleCuda, "cuDeviceGetName");

CuLinkAddDataFcnPtr =		CuLinkAddDataFcnPtr =
(CuLinkAddDataFcnTy *)getAPIHandle(HandleCuda, "cuLinkAddData");		(CuLinkAddDataFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLinkAddData");

CuLinkCreateFcnPtr =		CuLinkCreateFcnPtr =
(CuLinkCreateFcnTy *)getAPIHandle(HandleCuda, "cuLinkCreate");		(CuLinkCreateFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLinkCreate");

CuLinkCompleteFcnPtr =		CuLinkCompleteFcnPtr =
(CuLinkCompleteFcnTy *)getAPIHandle(HandleCuda, "cuLinkComplete");		(CuLinkCompleteFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLinkComplete");

CuLinkDestroyFcnPtr =		CuLinkDestroyFcnPtr =
(CuLinkDestroyFcnTy *)getAPIHandle(HandleCuda, "cuLinkDestroy");		(CuLinkDestroyFcnTy *)getAPIHandleCUDA(HandleCuda, "cuLinkDestroy");

CuCtxSynchronizeFcnPtr =		CuCtxSynchronizeFcnPtr =
(CuCtxSynchronizeFcnTy *)getAPIHandle(HandleCuda, "cuCtxSynchronize");		(CuCtxSynchronizeFcnTy *)getAPIHandleCUDA(HandleCuda, "cuCtxSynchronize");

/* Get function pointer to CUDA Runtime APIs. */		/* Get function pointer to CUDA Runtime APIs. */
CudaThreadSynchronizeFcnPtr = (CudaThreadSynchronizeFcnTy *)getAPIHandle(		CudaThreadSynchronizeFcnPtr = (CudaThreadSynchronizeFcnTy *)getAPIHandleCUDA(
HandleCudaRT, "cudaThreadSynchronize");		HandleCudaRT, "cudaThreadSynchronize");

return 1;		return 1;
}		}

PollyGPUContext *polly_initContext() {		static PollyGPUContext *initContextCUDA() {
DebugMode = getenv("POLLY_DEBUG") != 0;

dump_function();		dump_function();
PollyGPUContext *Context;		PollyGPUContext *Context;
		MeinersburUnsubmitted Done Reply Inline Actions Unrelated whitespace change? Meinersbur: Unrelated whitespace change?
CUdevice Device;		CUdevice Device;

int Major = 0, Minor = 0, DeviceID = 0;		int Major = 0, Minor = 0, DeviceID = 0;
char DeviceName[256];		char DeviceName[256];
int DeviceCount = 0;		int DeviceCount = 0;

static __thread PollyGPUContext *CurrentContext = NULL;		static __thread PollyGPUContext *CurrentContext = NULL;

if (CurrentContext)		if (CurrentContext)
return CurrentContext;		return CurrentContext;

/* Get API handles. */		/* Get API handles. */
if (initialDeviceAPIs() == 0) {		if (initialDeviceAPIsCUDA() == 0) {
fprintf(stdout, "Getting the \"handle\" for the CUDA driver API failed.\n");		fprintf(stderr, "Getting the \"handle\" for the CUDA driver API failed.\n");
exit(-1);		exit(-1);
}		}

if (CuInitFcnPtr(0) != CUDA_SUCCESS) {		if (CuInitFcnPtr(0) != CUDA_SUCCESS) {
fprintf(stdout, "Initializing the CUDA driver API failed.\n");		fprintf(stderr, "Initializing the CUDA driver API failed.\n");
exit(-1);		exit(-1);
}		}

/* Get number of devices that supports CUDA. */		/* Get number of devices that supports CUDA. */
CuDeviceGetCountFcnPtr(&DeviceCount);		CuDeviceGetCountFcnPtr(&DeviceCount);
if (DeviceCount == 0) {		if (DeviceCount == 0) {
fprintf(stdout, "There is no device supporting CUDA.\n");		fprintf(stderr, "There is no device supporting CUDA.\n");
exit(-1);		exit(-1);
}		}

CuDeviceGetFcnPtr(&Device, 0);		CuDeviceGetFcnPtr(&Device, 0);

/* Get compute capabilities and the device name. */		/* Get compute capabilities and the device name. */
CuDeviceComputeCapabilityFcnPtr(&Major, &Minor, Device);		CuDeviceComputeCapabilityFcnPtr(&Major, &Minor, Device);
CuDeviceGetNameFcnPtr(DeviceName, 256, Device);		CuDeviceGetNameFcnPtr(DeviceName, 256, Device);
debug_print("> Running on GPU device %d : %s.\n", DeviceID, DeviceName);		debug_print("> Running on GPU device %d : %s.\n", DeviceID, DeviceName);

/* Create context on the device. */		/* Create context on the device. */
Context = (PollyGPUContext *)malloc(sizeof(PollyGPUContext));		Context = (PollyGPUContext *)malloc(sizeof(PollyGPUContext));
if (Context == 0) {		if (Context == 0) {
fprintf(stdout, "Allocate memory for Polly GPU context failed.\n");		fprintf(stderr, "Allocate memory for Polly GPU context failed.\n");
exit(-1);		exit(-1);
}		}
CuCtxCreateFcnPtr(&(Context->Cuda), 0, Device);		Context->Context = malloc(sizeof(CUDAContext));
		if (Context->Context == 0) {
CacheMode = getenv("POLLY_NOCACHE") == 0;		fprintf(stderr, "Allocate memory for Polly CUDA context failed.\n");
		exit(-1);
		}
		CuCtxCreateFcnPtr(&(((CUDAContext *)Context->Context)->Cuda), 0, Device);

if (CacheMode)		if (CacheMode)
CurrentContext = Context;		CurrentContext = Context;

return Context;		return Context;
}		}

static void freeKernel(PollyGPUFunction *Kernel) {		static void freeKernelCUDA(PollyGPUFunction *Kernel) {
if (Kernel->CudaModule)		dump_function();
CuModuleUnloadFcnPtr(Kernel->CudaModule);
		if (CacheMode)
		return;

		if (((CUDAKernel *)Kernel->Kernel)->CudaModule)
		CuModuleUnloadFcnPtr(((CUDAKernel *)Kernel->Kernel)->CudaModule);

		if (Kernel->Kernel)
		free((CUDAKernel *)Kernel->Kernel);

if (Kernel)		if (Kernel)
free(Kernel);		free(Kernel);
}		}

#define KERNEL_CACHE_SIZE 10		static PollyGPUFunction getKernelCUDA(const char BinaryBuffer,

PollyGPUFunction polly_getKernel(const char PTXBuffer,
const char *KernelName) {		const char *KernelName) {
dump_function();		dump_function();

static __thread PollyGPUFunction *KernelCache[KERNEL_CACHE_SIZE];		static __thread PollyGPUFunction *KernelCache[KERNEL_CACHE_SIZE];
static __thread int NextCacheItem = 0;		static __thread int NextCacheItem = 0;

for (long i = 0; i < KERNEL_CACHE_SIZE; i++) {		for (long i = 0; i < KERNEL_CACHE_SIZE; i++) {
// We exploit here the property that all Polly-ACC kernels are allocated		// We exploit here the property that all Polly-ACC kernels are allocated
// as global constants, hence a pointer comparision is sufficient to		// as global constants, hence a pointer comparision is sufficient to
// determin equality.		// determin equality.
if (KernelCache[i] && KernelCache[i]->PTXString == PTXBuffer) {		if (KernelCache[i] &&
		((CUDAKernel *)KernelCache[i]->Kernel)->BinaryString == BinaryBuffer) {
debug_print(" -> using cached kernel\n");		debug_print(" -> using cached kernel\n");
return KernelCache[i];		return KernelCache[i];
}		}
}		}

PollyGPUFunction *Function = malloc(sizeof(PollyGPUFunction));		PollyGPUFunction *Function = malloc(sizeof(PollyGPUFunction));

if (Function == 0) {		if (Function == 0) {
fprintf(stdout, "Allocate memory for Polly GPU function failed.\n");		fprintf(stderr, "Allocate memory for Polly GPU function failed.\n");
		exit(-1);
		}
		Function->Kernel = (CUDAKernel *)malloc(sizeof(CUDAKernel));
		if (Function->Kernel == 0) {
		fprintf(stderr, "Allocate memory for Polly CUDA function failed.\n");
exit(-1);		exit(-1);
}		}

CUresult Res;		CUresult Res;
CUlinkState LState;		CUlinkState LState;
CUjit_option Options[6];		CUjit_option Options[6];
void *OptionVals[6];		void *OptionVals[6];
float Walltime = 0;		float Walltime = 0;
Show All 20 Lines	static PollyGPUFunction getKernelCUDA(const char BinaryBuffer,
OptionVals[4] = (void *)LogSize;		OptionVals[4] = (void *)LogSize;
// Make the linker verbose		// Make the linker verbose
Options[5] = CU_JIT_LOG_VERBOSE;		Options[5] = CU_JIT_LOG_VERBOSE;
OptionVals[5] = (void *)1;		OptionVals[5] = (void *)1;

memset(ErrorLog, 0, sizeof(ErrorLog));		memset(ErrorLog, 0, sizeof(ErrorLog));

CuLinkCreateFcnPtr(6, Options, OptionVals, &LState);		CuLinkCreateFcnPtr(6, Options, OptionVals, &LState);
Res = CuLinkAddDataFcnPtr(LState, CU_JIT_INPUT_PTX, (void *)PTXBuffer,		Res = CuLinkAddDataFcnPtr(LState, CU_JIT_INPUT_PTX, (void *)BinaryBuffer,
strlen(PTXBuffer) + 1, 0, 0, 0, 0);		strlen(BinaryBuffer) + 1, 0, 0, 0, 0);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stdout, "PTX Linker Error:\n%s\n%s", ErrorLog, InfoLog);		fprintf(stderr, "PTX Linker Error:\n%s\n%s", ErrorLog, InfoLog);
exit(-1);		exit(-1);
}		}

Res = CuLinkCompleteFcnPtr(LState, &CuOut, &OutSize);		Res = CuLinkCompleteFcnPtr(LState, &CuOut, &OutSize);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stdout, "Complete ptx linker step failed.\n");		fprintf(stderr, "Complete ptx linker step failed.\n");
fprintf(stdout, "\n%s\n", ErrorLog);		fprintf(stderr, "\n%s\n", ErrorLog);
exit(-1);		exit(-1);
}		}

debug_print("CUDA Link Completed in %fms. Linker Output:\n%s\n", Walltime,		debug_print("CUDA Link Completed in %fms. Linker Output:\n%s\n", Walltime,
InfoLog);		InfoLog);

Res = CuModuleLoadDataFcnPtr(&(Function->CudaModule), CuOut);		Res = CuModuleLoadDataFcnPtr(&(((CUDAKernel *)Function->Kernel)->CudaModule),
		CuOut);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stdout, "Loading ptx assembly text failed.\n");		fprintf(stderr, "Loading ptx assembly text failed.\n");
exit(-1);		exit(-1);
}		}

Res = CuModuleGetFunctionFcnPtr(&(Function->Cuda), Function->CudaModule,		Res = CuModuleGetFunctionFcnPtr(&(((CUDAKernel *)Function->Kernel)->Cuda),
		((CUDAKernel *)Function->Kernel)->CudaModule,
KernelName);		KernelName);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stdout, "Loading kernel function failed.\n");		fprintf(stderr, "Loading kernel function failed.\n");
exit(-1);		exit(-1);
}		}

CuLinkDestroyFcnPtr(LState);		CuLinkDestroyFcnPtr(LState);

Function->PTXString = PTXBuffer;		((CUDAKernel *)Function->Kernel)->BinaryString = BinaryBuffer;

if (CacheMode) {		if (CacheMode) {
if (KernelCache[NextCacheItem])		if (KernelCache[NextCacheItem])
freeKernel(KernelCache[NextCacheItem]);		freeKernelCUDA(KernelCache[NextCacheItem]);

KernelCache[NextCacheItem] = Function;		KernelCache[NextCacheItem] = Function;

NextCacheItem = (NextCacheItem + 1) % KERNEL_CACHE_SIZE;		NextCacheItem = (NextCacheItem + 1) % KERNEL_CACHE_SIZE;
}		}

return Function;		return Function;
}		}

void polly_freeKernel(PollyGPUFunction *Kernel) {		static void synchronizeDeviceCUDA() {
dump_function();		dump_function();
		if (CuCtxSynchronizeFcnPtr() != CUDA_SUCCESS) {
if (CacheMode)		fprintf(stderr, "Synchronizing device and host memory failed.\n");
return;		exit(-1);
		}
freeKernel(Kernel);
}		}

void polly_copyFromHostToDevice(void HostData, PollyGPUDevicePtr DevData,		static void copyFromHostToDeviceCUDA(void HostData, PollyGPUDevicePtr DevData,
long MemSize) {		long MemSize) {
dump_function();		dump_function();

CUdeviceptr CuDevData = DevData->Cuda;		CUdeviceptr CuDevData = ((CUDADevicePtr *)DevData->DevicePtr)->Cuda;
CuMemcpyHtoDFcnPtr(CuDevData, HostData, MemSize);		CuMemcpyHtoDFcnPtr(CuDevData, HostData, MemSize);
}		}

void polly_copyFromDeviceToHost(PollyGPUDevicePtr DevData, void HostData,		static void copyFromDeviceToHostCUDA(PollyGPUDevicePtr DevData, void HostData,
long MemSize) {		long MemSize) {
dump_function();		dump_function();

if (CuMemcpyDtoHFcnPtr(HostData, DevData->Cuda, MemSize) != CUDA_SUCCESS) {		if (CuMemcpyDtoHFcnPtr(HostData, ((CUDADevicePtr *)DevData->DevicePtr)->Cuda,
fprintf(stdout, "Copying results from device to host memory failed.\n");		MemSize) != CUDA_SUCCESS) {
exit(-1);		fprintf(stderr, "Copying results from device to host memory failed.\n");
}
}
void polly_synchronizeDevice() {
dump_function();
if (CuCtxSynchronizeFcnPtr() != CUDA_SUCCESS) {
fprintf(stdout, "Synchronizing device and host memory failed.\n");
exit(-1);		exit(-1);
}		}
}		}

void polly_launchKernel(PollyGPUFunction *Kernel, unsigned int GridDimX,		static void launchKernelCUDA(PollyGPUFunction *Kernel, unsigned int GridDimX,
unsigned int GridDimY, unsigned int BlockDimX,		unsigned int GridDimY, unsigned int BlockDimX,
unsigned int BlockDimY, unsigned int BlockDimZ,		unsigned int BlockDimY, unsigned int BlockDimZ,
void **Parameters) {		void **Parameters) {
dump_function();		dump_function();

unsigned GridDimZ = 1;		unsigned GridDimZ = 1;
unsigned int SharedMemBytes = CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE;		unsigned int SharedMemBytes = CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE;
CUstream Stream = 0;		CUstream Stream = 0;
void **Extra = 0;		void **Extra = 0;

CUresult Res;		CUresult Res;
Res = CuLaunchKernelFcnPtr(Kernel->Cuda, GridDimX, GridDimY, GridDimZ,		Res =
BlockDimX, BlockDimY, BlockDimZ, SharedMemBytes,		CuLaunchKernelFcnPtr(((CUDAKernel *)Kernel->Kernel)->Cuda, GridDimX,
Stream, Parameters, Extra);		GridDimY, GridDimZ, BlockDimX, BlockDimY, BlockDimZ,
		SharedMemBytes, Stream, Parameters, Extra);
if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stdout, "Launching CUDA kernel failed.\n");		fprintf(stderr, "Launching CUDA kernel failed.\n");
exit(-1);		exit(-1);
}		}
}		}

void polly_freeDeviceMemory(PollyGPUDevicePtr *Allocation) {		static void freeDeviceMemoryCUDA(PollyGPUDevicePtr *Allocation) {
dump_function();		dump_function();
CuMemFreeFcnPtr((CUdeviceptr)Allocation->Cuda);		CUDADevicePtr DevPtr = (CUDADevicePtr )Allocation->DevicePtr;
		CuMemFreeFcnPtr((CUdeviceptr)DevPtr->Cuda);
		free(DevPtr);
free(Allocation);		free(Allocation);
}		}

PollyGPUDevicePtr *polly_allocateMemoryForDevice(long MemSize) {		static PollyGPUDevicePtr *allocateMemoryForDeviceCUDA(long MemSize) {
dump_function();		dump_function();

PollyGPUDevicePtr *DevData = malloc(sizeof(PollyGPUDevicePtr));		PollyGPUDevicePtr *DevData = malloc(sizeof(PollyGPUDevicePtr));

if (DevData == 0) {		if (DevData == 0) {
fprintf(stdout, "Allocate memory for GPU device memory pointer failed.\n");		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
		exit(-1);
		}
		DevData->DevicePtr = (CUDADevicePtr *)malloc(sizeof(CUDADevicePtr));
		if (DevData->DevicePtr == 0) {
		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
exit(-1);		exit(-1);
}		}

CUresult Res = CuMemAllocFcnPtr(&(DevData->Cuda), MemSize);		CUresult Res =
		CuMemAllocFcnPtr(&(((CUDADevicePtr *)DevData->DevicePtr)->Cuda), MemSize);

if (Res != CUDA_SUCCESS) {		if (Res != CUDA_SUCCESS) {
fprintf(stdout, "Allocate memory for GPU device memory pointer failed.\n");		fprintf(stderr, "Allocate memory for GPU device memory pointer failed.\n");
exit(-1);		exit(-1);
}		}

return DevData;		return DevData;
}		}

		static void getDevicePtrCUDA(PollyGPUDevicePtr Allocation) {
		dump_function();

		CUDADevicePtr DevPtr = (CUDADevicePtr )Allocation->DevicePtr;
		return (void *)DevPtr->Cuda;
		}

		static void freeContextCUDA(PollyGPUContext *Context) {
		dump_function();

		CUDAContext Ctx = (CUDAContext )Context->Context;
		if (Ctx->Cuda) {
		CuCtxDestroyFcnPtr(Ctx->Cuda);
		free(Ctx);
		free(Context);
		}

		dlclose(HandleCuda);
		dlclose(HandleCudaRT);
		}

		#endif /* HAS_LIBCUDART */
		/******************************************************************************/
		/* API */
		/******************************************************************************/

		PollyGPUContext *polly_initContext() {
		DebugMode = getenv("POLLY_DEBUG") != 0;
		CacheMode = getenv("POLLY_NOCACHE") == 0;

		dump_function();

		PollyGPUContext *Context;

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		Context = initContextCUDA();
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		Context = initContextCL();
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}

		return Context;
		}

		void polly_freeKernel(PollyGPUFunction *Kernel) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		freeKernelCUDA(Kernel);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		freeKernelCL(Kernel);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		PollyGPUFunction polly_getKernel(const char BinaryBuffer,
		const char *KernelName) {
		dump_function();

		PollyGPUFunction *Function;

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		Function = getKernelCUDA(BinaryBuffer, KernelName);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		Function = getKernelCL(BinaryBuffer, KernelName);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}

		return Function;
		}

		void polly_copyFromHostToDevice(void HostData, PollyGPUDevicePtr DevData,
		long MemSize) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		copyFromHostToDeviceCUDA(HostData, DevData, MemSize);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		copyFromHostToDeviceCL(HostData, DevData, MemSize);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		void polly_copyFromDeviceToHost(PollyGPUDevicePtr DevData, void HostData,
		long MemSize) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		copyFromDeviceToHostCUDA(DevData, HostData, MemSize);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		copyFromDeviceToHostCL(DevData, HostData, MemSize);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		void polly_launchKernel(PollyGPUFunction *Kernel, unsigned int GridDimX,
		unsigned int GridDimY, unsigned int BlockDimX,
		unsigned int BlockDimY, unsigned int BlockDimZ,
		void **Parameters) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		launchKernelCUDA(Kernel, GridDimX, GridDimY, BlockDimX, BlockDimY,
		BlockDimZ, Parameters);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		launchKernelCL(Kernel, GridDimX, GridDimY, BlockDimX, BlockDimY, BlockDimZ,
		Parameters);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		void polly_freeDeviceMemory(PollyGPUDevicePtr *Allocation) {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		freeDeviceMemoryCUDA(Allocation);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		freeDeviceMemoryCL(Allocation);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
		}

		PollyGPUDevicePtr *polly_allocateMemoryForDevice(long MemSize) {
		dump_function();

		PollyGPUDevicePtr *DevData;

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		DevData = allocateMemoryForDeviceCUDA(MemSize);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		DevData = allocateMemoryForDeviceCL(MemSize);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}

		return DevData;
		}

void polly_getDevicePtr(PollyGPUDevicePtr Allocation) {		void polly_getDevicePtr(PollyGPUDevicePtr Allocation) {
dump_function();		dump_function();

return (void *)Allocation->Cuda;		void *DevPtr;

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		DevPtr = getDevicePtrCUDA(Allocation);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		DevPtr = getDevicePtrCL(Allocation);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}

		return DevPtr;
		}

		void polly_synchronizeDevice() {
		dump_function();

		switch (Runtime) {
		#ifdef HAS_LIBCUDART
		case RUNTIME_CUDA:
		synchronizeDeviceCUDA();
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		synchronizeDeviceCL();
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
}		}

void polly_freeContext(PollyGPUContext *Context) {		void polly_freeContext(PollyGPUContext *Context) {
dump_function();		dump_function();

if (CacheMode)		if (CacheMode)
return;		return;

if (Context->Cuda) {		switch (Runtime) {
CuCtxDestroyFcnPtr(Context->Cuda);		#ifdef HAS_LIBCUDART
free(Context);		case RUNTIME_CUDA:
		freeContextCUDA(Context);
		break;
		#endif /* HAS_LIBCUDART */
		#ifdef HAS_LIBOPENCL
		case RUNTIME_CL:
		freeContextCL(Context);
		break;
		#endif /* HAS_LIBOPENCL */
		default:
		err_runtime();
		}
}		}

dlclose(HandleCuda);		/* Initialize GPUJIT with CUDA as runtime library. */
dlclose(HandleCudaRT);		PollyGPUContext *polly_initContextCUDA() {
		#ifdef HAS_LIBCUDART
		Runtime = RUNTIME_CUDA;
		return polly_initContext();
		#else
		fprintf(stderr, "GPU Runtime was built without CUDA support.\n");
		exit(-1);
		#endif /* HAS_LIBCUDART */
		}

		/* Initialize GPUJIT with OpenCL as runtime library. */
		PollyGPUContext *polly_initContextCL() {
		#ifdef HAS_LIBOPENCL
		Runtime = RUNTIME_CL;
		return polly_initContext();
		#else
		fprintf(stderr, "GPU Runtime was built without OpenCL support.\n");
		exit(-1);
		#endif /* HAS_LIBOPENCL */
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGenClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 97804

CMakeLists.txt

include/polly/CodeGen/PPCGCodeGeneration.h

include/polly/LinkAllPasses.h

lib/CodeGen/PPCGCodeGeneration.cpp

lib/Support/RegisterPasses.cpp

test/GPGPU/cuda-managed-memory-simple.ll

test/GPGPU/size-cast.ll

tools/CMakeLists.txt

tools/GPURuntime/GPUJIT.h

tools/GPURuntime/GPUJIT.c

[Polly] Added OpenCL Runtime to GPURuntime Library for GPGPU CodeGen
ClosedPublic