This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
polly/trunk/
-
trunk/
-
include/polly/CodeGen/
-
polly/
-
CodeGen/
-
PPCGCodeGeneration.h
-
lib/
-
CodeGen/
-
PPCGCodeGeneration.cpp
-
Support/
-
RegisterPasses.cpp
-
test/GPGPU/
-
GPGPU/
-
spir-codegen.ll
-
tools/GPURuntime/
-
GPURuntime/
-
GPUJIT.c

Differential D35185

[Polly][GPGPU] Added SPIR Code Generation and Corresponding Runtime Support for Intel
ClosedPublic

Authored by PhilippSchaad on Jul 9 2017, 11:51 AM.

Download Raw Diff

Details

Reviewers

bollu
grosser
Meinersbur
singam-sanjay

Commits

rG2f3073b5cb88: [Polly][GPGPU] Added SPIR Code Generation and Corresponding Runtime Support for…
rPLO308751: [Polly][GPGPU] Added SPIR Code Generation and Corresponding Runtime Support for…
rL308751: [Polly][GPGPU] Added SPIR Code Generation and Corresponding Runtime Support for…

Summary

Added SPIR Code Generation to the PPCG Code Generator. This can be invoked using
the polly-gpu-arch flag value 'spir32' or 'spir64' for 32 and 64 bit code respectively.
In addition to that, runtime support has been added to execute said SPIR code on Intel
GPU's, where the system is equipped with Intel's open source driver Beignet (development
version). This requires the cmake flag 'USE_INTEL_OCL' to be turned on, and the polly-gpu-runtime
flag value to be 'libopencl'.
The transformation of LLVM IR to SPIR is currently quite a hack, consisting in part of regex
string transformations.
Has been tested (working) with Polybench 3.2 on an Intel i7-5500U (integrated graphics chip).

Diff Detail

Repository: rL LLVM

Event Timeline

PhilippSchaad created this revision.Jul 9 2017, 11:51 AM

Herald added a reviewer: bollu. · View Herald TranscriptJul 9 2017, 11:51 AM

Herald added subscribers: kbarton, Anastasia, mgorny, nemanjai. · View Herald Transcript

Harbormaster completed remote builds in B8083: Diff 105789.Jul 9 2017, 11:51 AM

PhilippSchaad added reviewers: grosser, Meinersbur.Jul 9 2017, 11:52 AM

PhilippSchaad set the repository for this revision to rL LLVM.

PhilippSchaad added a project: Restricted Project.

PhilippSchaad edited the summary of this revision. (Show Details)

PhilippSchaad added a subscriber: pollydev.

Hi Philipp,

this looks indeed already surprisingly good. As you noted, there are still some hacks, but I have good hopes that we can work around most of them. I added some comments. Let me know what you think!

Best,
Tobias

lib/CodeGen/PPCGCodeGeneration.cpp
40 ↗	(On Diff #105789)	After having removed the hack below, this should hopefully not be needed any more.
1247 ↗	(On Diff #105789)	Why do you emit NVIDIA intrinsics for SPIR code? Can you not emit an opencl barrier? If you don't know what intrinsic corresponds to this, create a simple .cl file, compile it with clang to SPIR and see which intrinsic gets emitted.
1770 ↗	(On Diff #105789)	Instead of spreading the SPIR code all over the place, can we create one function addSPIRMetadata, that takes 'Args' as argument, emits all the metadata needed and also emits the metadata you add a little further below?
1870 ↗	(On Diff #105789)	Why do you emit NVIDIA intrinsics. As discussed above, this there no way to emit the OpenCL intrinsics instead?
2169 ↗	(On Diff #105789)	It seems all these regexp hacks are needed because we don't generate the right intrinsics. What is holding us back from generating them directly?
2197 ↗	(On Diff #105789)	Interesting. So the optimization break stuff. Would be great to understand what exactly we brake!
tools/GPURuntime/GPUJIT.c
159 ↗	(On Diff #105789)	Why do these declarations need to be optional? I think they should compile easily in any situation, no?
230 ↗	(On Diff #105789)	I am surprised. Intel's OpenCL uses such a different path. In fact, I would assume that libOpenCL.so as a generic OpenCL library would take care of loading beignet. Is this not the case?
289 ↗	(On Diff #105789)	Does this break or does this just return nullptr if the function is not available. If we can always try and just detect if it is not around, this would be great!
521 ↗	(On Diff #105789)	Why do we read the file from "kernel.ll" and not use the file passed to us in BinaryBuffer?

Set this to request changes, that I get notified in case you update this revision.

This revision now requires changes to proceed.Jul 10 2017, 1:51 PM

PhilippSchaad added inline comments.Jul 10 2017, 4:06 PM

lib/CodeGen/PPCGCodeGeneration.cpp
40 ↗	(On Diff #105789)	That is correct, yes. This should only be needed as long as we have to manually change the string occurrences for barriers and global/local IDs.
1247 ↗	(On Diff #105789)	The generated intrinsics are the ones declared in the SPIR reference, so for example @_Z13get_global_idj(i32 0) . I cannot find any corresponding intrinsics in LLVM's TableGen/Intrinsics, did I overlook some? This is essentially the only real point where the hack is needed. If we can insert correct intrinsics, we can ditch the regex. Because fixing this is all the regex does.
1770 ↗	(On Diff #105789)	Will look into that.
1870 ↗	(On Diff #105789)	Dito above.
2169 ↗	(On Diff #105789)	Dito above.
2197 ↗	(On Diff #105789)	Looking into that. What is clear, is that the optimized version skips a few 'dummy' computations, because the optimized kernels expect a smaller workgroup size in OpenCL. Not sure if and how we could account for that. It seems very tricky to me atm, have played around with it quite a bit.
tools/GPURuntime/GPUJIT.c
159 ↗	(On Diff #105789)	Oh yeah, my bad. Only the LLVMIntel one has to be conditional.
230 ↗	(On Diff #105789)	The funny thing is: it does, somewhat. If I only dlopen libOpenCL and then call for example clCreateProgramWithBinary, looking at that in gdb reveals that the beignet routines get loaded by libOpenCL. However as soon as a function is not present in the standard OpenCL API, eg. clCreateProgramWithLLVMIntel, it does not like that.
289 ↗	(On Diff #105789)	I'm gonna look into that, but iirc it didn't work. Might be wrong, will try it out again!
521 ↗	(On Diff #105789)	Well the problem is that clCreateProgramWithLLVMIntel wants to read from a file, and requests a filename/filepath as an argument there. BinaryBuffer however holds the complete binary. If you know a simpler solution, that would of course be great, because this is another hacky-part.

PhilippSchaad updated this revision to Diff 105946.Jul 10 2017, 5:29 PM

PhilippSchaad edited edge metadata.

Adapted dynamic method loading for intel

grosser added inline comments.Jul 10 2017, 10:08 PM

CMakeLists.txt
178 ↗	(On Diff #105946)	Forgot to mention, but would be great if we could get away without a configure test here.
lib/CodeGen/PPCGCodeGeneration.cpp
1247 ↗	(On Diff #105789)	No, I think they may not exist. However, you can just create declarations for such functions using code as in "GPUNodeBuilder::createCallGetKernel(".
1770 ↗	(On Diff #105789)	I just see this is not only about the llvm::Type, but also carries an integer which is 0 for local/shared memory and 1 otherwise. So moving this into a function does not seem so easy. If you don't find an easy way, just leave it as it is.
2197 ↗	(On Diff #105789)	It's OK. Let's get the remainder of the patch updated, then we can look at this issue later on.
tools/GPURuntime/GPUJIT.c
521 ↗	(On Diff #105789)	I see. Maybe that's fine then. Just add a comment for now. I would prefer if we could use clCreateProgramWithBinary and just pass an LLVM bitcode file, but this likely only works if the LLVM versions of beignet is very modern. Let's leave it for now like this. It's good to have something working!

grosser requested changes to this revision.Jul 10 2017, 10:15 PM

This revision now requires changes to proceed.Jul 10 2017, 10:15 PM

Inserting SPIR barriers with custom function.

CMakeLists.txt
178 ↗	(On Diff #105946)	Yes I agree. I don't know of a method for detecting a Beignet installation with CMake yet, but if that exists, that would of course be great.
tools/GPURuntime/GPUJIT.c
521 ↗	(On Diff #105789)	Totally agreed, that would be the better option. As you pointed out, it is currently not yet possible.

Added comment clarification for workaround.

Removed Regexp hack

PhilippSchaad marked 6 inline comments as done.Jul 11 2017, 4:58 AM

Nice. Can you potentially also add a test case?

CMakeLists.txt
178 ↗	(On Diff #105946)	I don't think we want to add this to cmake at all. At best, we just compile the beignet support in unconditionally, if this is possibile.
lib/CodeGen/PPCGCodeGeneration.cpp
1947 ↗	(On Diff #106011)	If you store the group names in an array, you can just index this array with "i".
2126 ↗	(On Diff #106011)	Maybe call this insertCUDAKernelIntrinsics and insertSPIRKernelIntrincics for consistency?
tools/GPURuntime/GPUJIT.c
230 ↗	(On Diff #105789)	OK. Can you then unconditionally try to load beignet/libcl.so into another variable, e.g. HandleOpenCLBeignet? I assume this handle should be nullptr if the dlopen fails, right? Can you then check if this handle is zero and use this to decide if you want to call clCreateProgramWithLLVMIntel? This should allow us to get rid of all the conditional compilation.

grosser requested changes to this revision.Jul 11 2017, 6:26 AM

This revision now requires changes to proceed.Jul 11 2017, 6:26 AM

grosser added inline comments.Jul 11 2017, 6:29 AM

tools/GPURuntime/GPUJIT.c
521 ↗	(On Diff #105789)	Another change. Could you possibly remove the conditional compilation here and just check if clCreateProgramWithLLVMIntelFcnPtr != nullptr?

Simplification
Removed compile-conditional CMake dependency

I am not yet too familiar with how to implement such a case and what exactly to check for, but if that is desired, I can look into it.

This should now address all concerns, minus the file-workaround.

lib/CodeGen/PPCGCodeGeneration.cpp
2126 ↗	(On Diff #106011)	The reason I kept this separated the way it is now, without specifying "CUDA" in the earlier version, is that SPIR should be the only example. When AMD support follows, or anything where there's a registered LLVM target with corresponding intrinsics (AMDGPU in that case), the original function could be used with just an additional switch case, changing the intrinsics IDs. Do you agree with that?

PhilippSchaad added inline comments.Jul 11 2017, 7:54 AM

lib/CodeGen/PPCGCodeGeneration.cpp
2126 ↗	(On Diff #106011)	And with 'example' I mean 'exception', my bad.

Great. Yes, a test case would be great. It is not very complicated. You already updated test cases in your last patch. I would suggest you copy one of the simple double nested loop kernels from the PTX test and just check that the right SPIR intrinsics are generated! That should be enough.

Otherwise, I think this is good to go. Very nice work!

lib/CodeGen/PPCGCodeGeneration.cpp
2126 ↗	(On Diff #106011)	Sure. Still, the function names could be more similar. What about insertKernelCallsSPIR?

This revision now requires changes to proceed.Jul 11 2017, 8:47 AM

Added testcase for SPIR
Fixed naming inconsistencies

LGTM from my side.

This revision is now accepted and ready to land.Jul 14 2017, 11:55 AM

@singam-sanjay can you have a final look and commit it if you are OK with this patch.

Sure. Will do in sometime.

@grosser Thanks for adding me as a reviewer ! It's helped me acquaint myself with the SPIR V codegen and understand GPUJIT better.

@PhilippSchaad I've made suggestions to restructure code, a possibly unintentional debug print statement and added some questions about the SPIR. Do let me know if (and why) some of my suggestions are unnecessary.

General Question: I was following a discussion about integrating the SPIR-V backend as a part of LLVM, i.e. moving it from /tools to lib/Target. If that were to happen, would we be able to generate and feed SPIR IR in the same way as NVPTX ?

lib/CodeGen/PPCGCodeGeneration.cpp
1792 ↗	(On Diff #106663)	@PhilippSchaad I'm not yet acquainted with the SPIR annotations. Could you please explain why it matters to let the SPIR backend know whether a kernel argument is a read only ?
1794 ↗	(On Diff #106663)	Why are you setting these metadata to empty strings ? Is this a feature yet to be implemented ?
2140 ↗	(On Diff #106663)	nitpick: Do we need this ?
2163 ↗	(On Diff #106663)	nitpick: Do we need this ?
2214 ↗	(On Diff #106663)	Would it be better to move the SPIR specific code into createKernelASM, as well ? You might not need the switch cases to handle buggy control flow like, case GPUArch::SPIR64: case GPUArch::SPIR32: llvm_unreachable("Cannot generate ASM for SPIR architecture");
tools/GPURuntime/GPUJIT.c
148 ↗	(On Diff #106663)	This is new to me. Why are you providing runtime support for Intel to JIT SPIR code ? Is it to run with the integrated GPU on an Intel CPU? Guessing from https://github.com/intel/beignet.
249 ↗	(On Diff #106663)	`void *Handle = ( HandleOpenCLBeignet ? HandleOpenCLBeignet : HandleOpenCL );` maybe ? Let me know if this isn't recommended.
251 ↗	(On Diff #106663)	Is it that we prefer to use libraries provided by Intel's SDK and inegrated GPU over other OpenCL drivers and devices ? I'm assuming that this leads the runtime to use the integrated GPU even when a more powerful Radeon card is present. Please correct me if I'm wrong.
508 ↗	(On Diff #106663)	Is this file opened inside <llvm_build>/bin or $PWD ? If it's $PWD, we should to take precautions to not open a file that the user needs. You could instead open a file in /tmp or %TEMP% in Windows, like "/tmp/R@ηdηдm3.ll". Anyways, you could implement this later.
514 ↗	(On Diff #106663)	Was this supposed to a message letting the user know that the runtime's not using the "beignet" OpenCL implementation ?
521 ↗	(On Diff #105789)	@grosser What is your rationale to avoid conditional compilation ?

Will address the points mentioned asap. As for your SPIR-V question: The goal, should that back-end be added, would be to add SPIR-V compilation as an additional GPU target option (In the future). This would allow kernel execution on any SPIR-V supporting target device. This SPIR solution is essentially a 'quickfix' to get it to work on Intel, as long as SPIR-V is not an option.

lib/CodeGen/PPCGCodeGeneration.cpp
1792 ↗	(On Diff #106663)	The issue here is the Intel Beignet driver. Beignet expects a very specific set of kernel annotations when parsing the Kernel IR. The metadata is simply required to be there, but for the kernels we generate here, it does not matter what is actually contained in there. This is entirely Beignet dependent, maybe Intel will adapt that in the future.
1794 ↗	(On Diff #106663)	Dito above.
2140 ↗	(On Diff #106663)	We can change it to an if-clause if that is desired. The compiler just complains when we have a switch and do not check every possible value of `Arch`. An alternative would be a default clause handling that `llvm_unreachable`.
2163 ↗	(On Diff #106663)	Dito above.
2214 ↗	(On Diff #106663)	Will do, good point. Originally moved it out because we thought we had to do a lot more hacking on the resulting IR.
tools/GPURuntime/GPUJIT.c
148 ↗	(On Diff #106663)	Your guess is exactly correct. This also corresponds to some of the points mentioned above.
249 ↗	(On Diff #106663)	Definitely the prettier solution. Will change.
251 ↗	(On Diff #106663)	This is actually a good point. We don't want to prefer the Intel provided SDK, the choice of selecting Intel's Beignet should be dependent on what -polly-gpu-arch value has been selected. AMD is in the pipeline.
508 ↗	(On Diff #106663)	I will add this later, but yes it is $PWD. The solution with using /tmp and %TEMP% seem reasonable.
514 ↗	(On Diff #106663)	Oops, thanks for pointing this out, this was unintentionally left in and should never be reached! Used it for debugging. Removing it.

singam-sanjay added inline comments.Jul 17 2017, 12:56 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1792 ↗	(On Diff #106663)	Ohk. Thanks for the info !
1794 ↗	(On Diff #106663)	Alright. I guess beignet isn't worried about the names of kernel arguments in "kernel_arg_name".
2140 ↗	(On Diff #106663)	I was talking about the `break;` after `llvm_unreachable`. If `llvm_unreachable` would call `abort`, the `break` would be redundant.
2163 ↗	(On Diff #106663)	Ditto above ;-).
2214 ↗	(On Diff #106663)	Alright.
tools/GPURuntime/GPUJIT.c
148 ↗	(On Diff #106663)	Ohk.
249 ↗	(On Diff #106663)	`void *Handle = ( HandleOpenCLBeignet!=NULL ? HandleOpenCLBeignet : HandleOpenCL );` would be better.
251 ↗	(On Diff #106663)	Alright. Any ideas on how you'd differentiate between the integrated and dedicated GPU at command line ?
508 ↗	(On Diff #106663)	Alright.
514 ↗	(On Diff #106663)	👍

Removed left over debug print, moved SPIR creation into createASM, fixed minor issues addressed in comments.

PhilippSchaad marked 24 inline comments as done.Jul 17 2017, 5:25 PM

PhilippSchaad added inline comments.

tools/GPURuntime/GPUJIT.c
251 ↗	(On Diff #106663)	The goal is to run on AMD GPUs with the AMDGPU backend, which would mean the differentiation would be done with -polly-gpu-arch=spir32/amdgcn32 for example. I am not yet sure how to transport that choice to here, but I will look into that as I am working on the AMD side of things. Thanks for pointing out this flaw.

Ping do the latest changes address your concerns @singam-sanjay ? Can I land this?

@PhilippSchaad Please go ahead.

Rebase for commit

Harbormaster completed remote builds in B8468: Diff 107678.Jul 21 2017, 8:52 AM

Closed by commit rL308751: [Polly][GPGPU] Added SPIR Code Generation and Corresponding Runtime Support for… (authored by phschaad). · Explain WhyJul 21 2017, 9:11 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

polly/

trunk/

include/

polly/

CodeGen/

PPCGCodeGeneration.h

2 lines

lib/

CodeGen/

PPCGCodeGeneration.cpp

169 lines

Support/

RegisterPasses.cpp

6 lines

test/

GPGPU/

spir-codegen.ll

118 lines

tools/

GPURuntime/

GPUJIT.c

101 lines

Diff 107681

polly/trunk/include/polly/CodeGen/PPCGCodeGeneration.h

	Show All 10 Lines
	// GPU mapping strategy.			// GPU mapping strategy.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef POLLY_PPCGCODEGENERATION_H			#ifndef POLLY_PPCGCODEGENERATION_H
	#define POLLY_PPCGCODEGENERATION_H			#define POLLY_PPCGCODEGENERATION_H

	/// The GPU Architecture to target.			/// The GPU Architecture to target.
	enum GPUArch { NVPTX64 };			enum GPUArch { NVPTX64, SPIR32, SPIR64 };

	/// The GPU Runtime implementation to use.			/// The GPU Runtime implementation to use.
	enum GPURuntime { CUDA, OpenCL };			enum GPURuntime { CUDA, OpenCL };

	#endif // POLLY_PPCGCODEGENERATION_H			#endif // POLLY_PPCGCODEGENERATION_H

polly/trunk/lib/CodeGen/PPCGCodeGeneration.cpp

Show First 20 Lines • Show All 539 Lines • ▼ Show 20 Lines	private:
Function createKernelFunctionDecl(ppcg_kernel Kernel,		Function createKernelFunctionDecl(ppcg_kernel Kernel,
SetVector<Value *> &SubtreeValues);		SetVector<Value *> &SubtreeValues);

/// Insert intrinsic functions to obtain thread and block ids.		/// Insert intrinsic functions to obtain thread and block ids.
///		///
/// @param The kernel to generate the intrinsic functions for.		/// @param The kernel to generate the intrinsic functions for.
void insertKernelIntrinsics(ppcg_kernel *Kernel);		void insertKernelIntrinsics(ppcg_kernel *Kernel);

		/// Insert function calls to retrieve the SPIR group/local ids.
		///
		/// @param The kernel to generate the function calls for.
		void insertKernelCallsSPIR(ppcg_kernel *Kernel);

/// Setup the creation of functions referenced by the GPU kernel.		/// Setup the creation of functions referenced by the GPU kernel.
///		///
/// 1. Create new function declarations in GPUModule which are the same as		/// 1. Create new function declarations in GPUModule which are the same as
/// SubtreeFunctions.		/// SubtreeFunctions.
///		///
/// 2. Populate IslNodeBuilder::ValueMap with mappings from		/// 2. Populate IslNodeBuilder::ValueMap with mappings from
/// old functions (that come from the original module) to new functions		/// old functions (that come from the original module) to new functions
/// (that are created within GPUModule). That way, we generate references		/// (that are created within GPUModule). That way, we generate references
▲ Show 20 Lines • Show All 693 Lines • ▼ Show 20 Lines	void GPUNodeBuilder::createScopStmt(isl_ast_expr *Expr,
if (Stmt->isBlockStmt())		if (Stmt->isBlockStmt())
BlockGen.copyStmt(*Stmt, LTS, Indexes);		BlockGen.copyStmt(*Stmt, LTS, Indexes);
else		else
RegionGen.copyStmt(*Stmt, LTS, Indexes);		RegionGen.copyStmt(*Stmt, LTS, Indexes);
}		}

void GPUNodeBuilder::createKernelSync() {		void GPUNodeBuilder::createKernelSync() {
Module *M = Builder.GetInsertBlock()->getParent()->getParent();		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
		const char *SpirName = "__gen_ocl_barrier_global";

Function *Sync;		Function *Sync;

switch (Arch) {		switch (Arch) {
		case GPUArch::SPIR64:
		case GPUArch::SPIR32:
		Sync = M->getFunction(SpirName);

		// If Sync is not available, declare it.
		if (!Sync) {
		GlobalValue::LinkageTypes Linkage = Function::ExternalLinkage;
		std::vector<Type *> Args;
		FunctionType *Ty = FunctionType::get(Builder.getVoidTy(), Args, false);
		Sync = Function::Create(Ty, Linkage, SpirName, M);
		Sync->setCallingConv(CallingConv::SPIR_FUNC);
		}
		break;
case GPUArch::NVPTX64:		case GPUArch::NVPTX64:
Sync = Intrinsic::getDeclaration(M, Intrinsic::nvvm_barrier0);		Sync = Intrinsic::getDeclaration(M, Intrinsic::nvvm_barrier0);
break;		break;
}		}

Builder.CreateCall(Sync, {});		Builder.CreateCall(Sync, {});
}		}

▲ Show 20 Lines • Show All 394 Lines • ▼ Show 20 Lines	void GPUNodeBuilder::createKernel(__isl_take isl_ast_node *KernelStmt) {

createKernelFunction(Kernel, SubtreeValues, SubtreeFunctions);		createKernelFunction(Kernel, SubtreeValues, SubtreeFunctions);
setupKernelSubtreeFunctions(SubtreeFunctions);		setupKernelSubtreeFunctions(SubtreeFunctions);

create(isl_ast_node_copy(Kernel->tree));		create(isl_ast_node_copy(Kernel->tree));

finalizeKernelArguments(Kernel);		finalizeKernelArguments(Kernel);
Function *F = Builder.GetInsertBlock()->getParent();		Function *F = Builder.GetInsertBlock()->getParent();
		if (Arch == GPUArch::NVPTX64)
addCUDAAnnotations(F->getParent(), BlockDimX, BlockDimY, BlockDimZ);		addCUDAAnnotations(F->getParent(), BlockDimX, BlockDimY, BlockDimZ);
clearDominators(F);		clearDominators(F);
clearScalarEvolution(F);		clearScalarEvolution(F);
clearLoops(F);		clearLoops(F);

IDToValue = HostIDs;		IDToValue = HostIDs;

ValueMap = std::move(HostValueMap);		ValueMap = std::move(HostValueMap);
ScalarMap = std::move(HostScalarMap);		ScalarMap = std::move(HostScalarMap);
Show All 40 Lines	if (!is64Bit) {
Ret += "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:"		Ret += "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:"
"64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:"		"64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:"
"64-v128:128:128-n16:32:64";		"64-v128:128:128-n16:32:64";
}		}

return Ret;		return Ret;
}		}

		/// Compute the DataLayout string for a SPIR kernel.
		///
		/// @param is64Bit Are we looking for a 64 bit architecture?
		static std::string computeSPIRDataLayout(bool is64Bit) {
		std::string Ret = "";

		if (!is64Bit) {
		Ret += "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:"
		"64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:"
		"32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:"
		"256:256-v256:256:256-v512:512:512-v1024:1024:1024";
		} else {
		Ret += "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:"
		"64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:"
		"32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:"
		"256:256-v256:256:256-v512:512:512-v1024:1024:1024";
		}

		return Ret;
		}

Function *		Function *
GPUNodeBuilder::createKernelFunctionDecl(ppcg_kernel *Kernel,		GPUNodeBuilder::createKernelFunctionDecl(ppcg_kernel *Kernel,
SetVector<Value *> &SubtreeValues) {		SetVector<Value *> &SubtreeValues) {
std::vector<Type *> Args;		std::vector<Type *> Args;
std::string Identifier = getKernelFuncName(Kernel->id);		std::string Identifier = getKernelFuncName(Kernel->id);

		std::vector<Metadata *> MemoryType;

for (long i = 0; i < Prog->n_array; i++) {		for (long i = 0; i < Prog->n_array; i++) {
if (!ppcg_kernel_requires_array_argument(Kernel, i))		if (!ppcg_kernel_requires_array_argument(Kernel, i))
continue;		continue;

if (gpu_array_is_read_only_scalar(&Prog->array[i])) {		if (gpu_array_is_read_only_scalar(&Prog->array[i])) {
isl_id *Id = isl_space_get_tuple_id(Prog->array[i].space, isl_dim_set);		isl_id *Id = isl_space_get_tuple_id(Prog->array[i].space, isl_dim_set);
const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(Id);		const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(Id);
Args.push_back(SAI->getElementType());		Args.push_back(SAI->getElementType());
		MemoryType.push_back(
		ConstantAsMetadata::get(ConstantInt::get(Builder.getInt32Ty(), 0)));
} else {		} else {
static const int UseGlobalMemory = 1;		static const int UseGlobalMemory = 1;
Args.push_back(Builder.getInt8PtrTy(UseGlobalMemory));		Args.push_back(Builder.getInt8PtrTy(UseGlobalMemory));
		MemoryType.push_back(
		ConstantAsMetadata::get(ConstantInt::get(Builder.getInt32Ty(), 1)));
}		}
}		}

int NumHostIters = isl_space_dim(Kernel->space, isl_dim_set);		int NumHostIters = isl_space_dim(Kernel->space, isl_dim_set);

for (long i = 0; i < NumHostIters; i++)		for (long i = 0; i < NumHostIters; i++) {
Args.push_back(Builder.getInt64Ty());		Args.push_back(Builder.getInt64Ty());
		MemoryType.push_back(
		ConstantAsMetadata::get(ConstantInt::get(Builder.getInt32Ty(), 0)));
		}

int NumVars = isl_space_dim(Kernel->space, isl_dim_param);		int NumVars = isl_space_dim(Kernel->space, isl_dim_param);

for (long i = 0; i < NumVars; i++) {		for (long i = 0; i < NumVars; i++) {
isl_id *Id = isl_space_get_dim_id(Kernel->space, isl_dim_param, i);		isl_id *Id = isl_space_get_dim_id(Kernel->space, isl_dim_param, i);
Value *Val = IDToValue[Id];		Value *Val = IDToValue[Id];
isl_id_free(Id);		isl_id_free(Id);
Args.push_back(Val->getType());		Args.push_back(Val->getType());
		MemoryType.push_back(
		ConstantAsMetadata::get(ConstantInt::get(Builder.getInt32Ty(), 0)));
}		}

for (auto *V : SubtreeValues)		for (auto *V : SubtreeValues) {
Args.push_back(V->getType());		Args.push_back(V->getType());
		MemoryType.push_back(
		ConstantAsMetadata::get(ConstantInt::get(Builder.getInt32Ty(), 0)));
		}

auto *FT = FunctionType::get(Builder.getVoidTy(), Args, false);		auto *FT = FunctionType::get(Builder.getVoidTy(), Args, false);
auto *FN = Function::Create(FT, Function::ExternalLinkage, Identifier,		auto *FN = Function::Create(FT, Function::ExternalLinkage, Identifier,
GPUModule.get());		GPUModule.get());

		std::vector<Metadata *> EmptyStrings;

		for (unsigned int i = 0; i < MemoryType.size(); i++) {
		EmptyStrings.push_back(MDString::get(FN->getContext(), ""));
		}

		if (Arch == GPUArch::SPIR32 \|\| Arch == GPUArch::SPIR64) {
		FN->setMetadata("kernel_arg_addr_space",
		MDNode::get(FN->getContext(), MemoryType));
		FN->setMetadata("kernel_arg_name",
		MDNode::get(FN->getContext(), EmptyStrings));
		FN->setMetadata("kernel_arg_access_qual",
		MDNode::get(FN->getContext(), EmptyStrings));
		FN->setMetadata("kernel_arg_type",
		MDNode::get(FN->getContext(), EmptyStrings));
		FN->setMetadata("kernel_arg_type_qual",
		MDNode::get(FN->getContext(), EmptyStrings));
		FN->setMetadata("kernel_arg_base_type",
		MDNode::get(FN->getContext(), EmptyStrings));
		}

switch (Arch) {		switch (Arch) {
case GPUArch::NVPTX64:		case GPUArch::NVPTX64:
FN->setCallingConv(CallingConv::PTX_Kernel);		FN->setCallingConv(CallingConv::PTX_Kernel);
break;		break;
		case GPUArch::SPIR32:
		case GPUArch::SPIR64:
		FN->setCallingConv(CallingConv::SPIR_KERNEL);
		break;
}		}

auto Arg = FN->arg_begin();		auto Arg = FN->arg_begin();
for (long i = 0; i < Kernel->n_array; i++) {		for (long i = 0; i < Kernel->n_array; i++) {
if (!ppcg_kernel_requires_array_argument(Kernel, i))		if (!ppcg_kernel_requires_array_argument(Kernel, i))
continue;		continue;

Arg->setName(Kernel->array[i].array->name);		Arg->setName(Kernel->array[i].array->name);
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	GPUNodeBuilder::createKernelFunctionDecl(ppcg_kernel *Kernel,
return FN;		return FN;
}		}

void GPUNodeBuilder::insertKernelIntrinsics(ppcg_kernel *Kernel) {		void GPUNodeBuilder::insertKernelIntrinsics(ppcg_kernel *Kernel) {
Intrinsic::ID IntrinsicsBID[2];		Intrinsic::ID IntrinsicsBID[2];
Intrinsic::ID IntrinsicsTID[3];		Intrinsic::ID IntrinsicsTID[3];

switch (Arch) {		switch (Arch) {
		case GPUArch::SPIR64:
		case GPUArch::SPIR32:
		llvm_unreachable("Cannot generate NVVM intrinsics for SPIR");
case GPUArch::NVPTX64:		case GPUArch::NVPTX64:
IntrinsicsBID[0] = Intrinsic::nvvm_read_ptx_sreg_ctaid_x;		IntrinsicsBID[0] = Intrinsic::nvvm_read_ptx_sreg_ctaid_x;
IntrinsicsBID[1] = Intrinsic::nvvm_read_ptx_sreg_ctaid_y;		IntrinsicsBID[1] = Intrinsic::nvvm_read_ptx_sreg_ctaid_y;

IntrinsicsTID[0] = Intrinsic::nvvm_read_ptx_sreg_tid_x;		IntrinsicsTID[0] = Intrinsic::nvvm_read_ptx_sreg_tid_x;
IntrinsicsTID[1] = Intrinsic::nvvm_read_ptx_sreg_tid_y;		IntrinsicsTID[1] = Intrinsic::nvvm_read_ptx_sreg_tid_y;
IntrinsicsTID[2] = Intrinsic::nvvm_read_ptx_sreg_tid_z;		IntrinsicsTID[2] = Intrinsic::nvvm_read_ptx_sreg_tid_z;
break;		break;
Show All 15 Lines	void GPUNodeBuilder::insertKernelIntrinsics(ppcg_kernel *Kernel) {
}		}

for (int i = 0; i < Kernel->n_block; ++i) {		for (int i = 0; i < Kernel->n_block; ++i) {
isl_id *Id = isl_id_list_get_id(Kernel->thread_ids, i);		isl_id *Id = isl_id_list_get_id(Kernel->thread_ids, i);
addId(Id, IntrinsicsTID[i]);		addId(Id, IntrinsicsTID[i]);
}		}
}		}

		void GPUNodeBuilder::insertKernelCallsSPIR(ppcg_kernel *Kernel) {
		const char *GroupName[3] = {"__gen_ocl_get_group_id0",
		"__gen_ocl_get_group_id1",
		"__gen_ocl_get_group_id2"};

		const char *LocalName[3] = {"__gen_ocl_get_local_id0",
		"__gen_ocl_get_local_id1",
		"__gen_ocl_get_local_id2"};

		auto createFunc = [this](const char Name, __isl_take isl_id Id) mutable {
		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
		Function *FN = M->getFunction(Name);

		// If FN is not available, declare it.
		if (!FN) {
		GlobalValue::LinkageTypes Linkage = Function::ExternalLinkage;
		std::vector<Type *> Args;
		FunctionType *Ty = FunctionType::get(Builder.getInt32Ty(), Args, false);
		FN = Function::Create(Ty, Linkage, Name, M);
		FN->setCallingConv(CallingConv::SPIR_FUNC);
		}

		Value *Val = Builder.CreateCall(FN, {});
		Val = Builder.CreateIntCast(Val, Builder.getInt64Ty(), false, Name);
		IDToValue[Id] = Val;
		KernelIDs.insert(std::unique_ptr<isl_id, IslIdDeleter>(Id));
		};

		for (int i = 0; i < Kernel->n_grid; ++i)
		createFunc(GroupName[i], isl_id_list_get_id(Kernel->block_ids, i));

		for (int i = 0; i < Kernel->n_block; ++i)
		createFunc(LocalName[i], isl_id_list_get_id(Kernel->thread_ids, i));
		}

void GPUNodeBuilder::prepareKernelArguments(ppcg_kernel Kernel, Function FN) {		void GPUNodeBuilder::prepareKernelArguments(ppcg_kernel Kernel, Function FN) {
auto Arg = FN->arg_begin();		auto Arg = FN->arg_begin();
for (long i = 0; i < Kernel->n_array; i++) {		for (long i = 0; i < Kernel->n_array; i++) {
if (!ppcg_kernel_requires_array_argument(Kernel, i))		if (!ppcg_kernel_requires_array_argument(Kernel, i))
continue;		continue;

isl_id *Id = isl_space_get_tuple_id(Prog->array[i].space, isl_dim_set);		isl_id *Id = isl_space_get_tuple_id(Prog->array[i].space, isl_dim_set);
const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(isl_id_copy(Id));		const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(isl_id_copy(Id));
▲ Show 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	void GPUNodeBuilder::createKernelFunction(
switch (Arch) {		switch (Arch) {
case GPUArch::NVPTX64:		case GPUArch::NVPTX64:
if (Runtime == GPURuntime::CUDA)		if (Runtime == GPURuntime::CUDA)
GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-cuda"));		GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-cuda"));
else if (Runtime == GPURuntime::OpenCL)		else if (Runtime == GPURuntime::OpenCL)
GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-nvcl"));		GPUModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-nvcl"));
GPUModule->setDataLayout(computeNVPTXDataLayout(true /* is64Bit */));		GPUModule->setDataLayout(computeNVPTXDataLayout(true /* is64Bit */));
break;		break;
		case GPUArch::SPIR32:
		GPUModule->setTargetTriple(Triple::normalize("spir-unknown-unknown"));
		GPUModule->setDataLayout(computeSPIRDataLayout(false /* is64Bit */));
		break;
		case GPUArch::SPIR64:
		GPUModule->setTargetTriple(Triple::normalize("spir64-unknown-unknown"));
		GPUModule->setDataLayout(computeSPIRDataLayout(true /* is64Bit */));
		break;
}		}

Function *FN = createKernelFunctionDecl(Kernel, SubtreeValues);		Function *FN = createKernelFunctionDecl(Kernel, SubtreeValues);

BasicBlock *PrevBlock = Builder.GetInsertBlock();		BasicBlock *PrevBlock = Builder.GetInsertBlock();
auto EntryBlock = BasicBlock::Create(Builder.getContext(), "entry", FN);		auto EntryBlock = BasicBlock::Create(Builder.getContext(), "entry", FN);

DT.addNewBlock(EntryBlock, PrevBlock);		DT.addNewBlock(EntryBlock, PrevBlock);

Builder.SetInsertPoint(EntryBlock);		Builder.SetInsertPoint(EntryBlock);
Builder.CreateRetVoid();		Builder.CreateRetVoid();
Builder.SetInsertPoint(EntryBlock, EntryBlock->begin());		Builder.SetInsertPoint(EntryBlock, EntryBlock->begin());

ScopDetection::markFunctionAsInvalid(FN);		ScopDetection::markFunctionAsInvalid(FN);

prepareKernelArguments(Kernel, FN);		prepareKernelArguments(Kernel, FN);
createKernelVariables(Kernel, FN);		createKernelVariables(Kernel, FN);

		switch (Arch) {
		case GPUArch::NVPTX64:
insertKernelIntrinsics(Kernel);		insertKernelIntrinsics(Kernel);
		break;
		case GPUArch::SPIR32:
		case GPUArch::SPIR64:
		insertKernelCallsSPIR(Kernel);
		break;
		}
}		}

std::string GPUNodeBuilder::createKernelASM() {		std::string GPUNodeBuilder::createKernelASM() {
llvm::Triple GPUTriple;		llvm::Triple GPUTriple;

switch (Arch) {		switch (Arch) {
case GPUArch::NVPTX64:		case GPUArch::NVPTX64:
switch (Runtime) {		switch (Runtime) {
case GPURuntime::CUDA:		case GPURuntime::CUDA:
GPUTriple = llvm::Triple(Triple::normalize("nvptx64-nvidia-cuda"));		GPUTriple = llvm::Triple(Triple::normalize("nvptx64-nvidia-cuda"));
break;		break;
case GPURuntime::OpenCL:		case GPURuntime::OpenCL:
GPUTriple = llvm::Triple(Triple::normalize("nvptx64-nvidia-nvcl"));		GPUTriple = llvm::Triple(Triple::normalize("nvptx64-nvidia-nvcl"));
break;		break;
}		}
break;		break;
		case GPUArch::SPIR64:
		case GPUArch::SPIR32:
		std::string SPIRAssembly;
		raw_string_ostream IROstream(SPIRAssembly);
		IROstream << *GPUModule;
		IROstream.flush();
		return SPIRAssembly;
}		}

std::string ErrMsg;		std::string ErrMsg;
auto GPUTarget = TargetRegistry::lookupTarget(GPUTriple.getTriple(), ErrMsg);		auto GPUTarget = TargetRegistry::lookupTarget(GPUTriple.getTriple(), ErrMsg);

if (!GPUTarget) {		if (!GPUTarget) {
errs() << ErrMsg << "\n";		errs() << ErrMsg << "\n";
return "";		return "";
}		}

TargetOptions Options;		TargetOptions Options;
Options.UnsafeFPMath = FastMath;		Options.UnsafeFPMath = FastMath;

std::string subtarget;		std::string subtarget;

switch (Arch) {		switch (Arch) {
case GPUArch::NVPTX64:		case GPUArch::NVPTX64:
subtarget = CudaVersion;		subtarget = CudaVersion;
break;		break;
		case GPUArch::SPIR32:
		case GPUArch::SPIR64:
		llvm_unreachable("No subtarget for SPIR architecture");
}		}

std::unique_ptr<TargetMachine> TargetM(GPUTarget->createTargetMachine(		std::unique_ptr<TargetMachine> TargetM(GPUTarget->createTargetMachine(
GPUTriple.getTriple(), subtarget, "", Options, Optional<Reloc::Model>()));		GPUTriple.getTriple(), subtarget, "", Options, Optional<Reloc::Model>()));

SmallString<0> ASMString;		SmallString<0> ASMString;
raw_svector_ostream ASMStream(ASMString);		raw_svector_ostream ASMStream(ASMString);
llvm::legacy::PassManager PM;		llvm::legacy::PassManager PM;
Show All 24 Lines	if (verifyModule(*GPUModule)) {

BuildSuccessful = false;		BuildSuccessful = false;
return "";		return "";
}		}

if (DumpKernelIR)		if (DumpKernelIR)
outs() << *GPUModule << "\n";		outs() << *GPUModule << "\n";

		if (Arch != GPUArch::SPIR32 && Arch != GPUArch::SPIR64) {
// Optimize module.		// Optimize module.
llvm::legacy::PassManager OptPasses;		llvm::legacy::PassManager OptPasses;
PassManagerBuilder PassBuilder;		PassManagerBuilder PassBuilder;
PassBuilder.OptLevel = 3;		PassBuilder.OptLevel = 3;
PassBuilder.SizeLevel = 0;		PassBuilder.SizeLevel = 0;
PassBuilder.populateModulePassManager(OptPasses);		PassBuilder.populateModulePassManager(OptPasses);
OptPasses.run(*GPUModule);		OptPasses.run(*GPUModule);
		}

std::string Assembly = createKernelASM();		std::string Assembly = createKernelASM();

if (DumpKernelASM)		if (DumpKernelASM)
outs() << Assembly << "\n";		outs() << Assembly << "\n";

GPUModule.release();		GPUModule.release();
KernelIDs.clear();		KernelIDs.clear();
▲ Show 20 Lines • Show All 1,017 Lines • Show Last 20 Lines

polly/trunk/lib/Support/RegisterPasses.cpp

Show First 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	cl::values(clEnumValN(GPURuntime::CUDA, "libcudart",
"use the CUDA Runtime API"),		"use the CUDA Runtime API"),
clEnumValN(GPURuntime::OpenCL, "libopencl",		clEnumValN(GPURuntime::OpenCL, "libopencl",
"use the OpenCL Runtime API")),		"use the OpenCL Runtime API")),
cl::init(GPURuntime::CUDA), cl::ZeroOrMore, cl::cat(PollyCategory));		cl::init(GPURuntime::CUDA), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<GPUArch>		static cl::opt<GPUArch>
GPUArchChoice("polly-gpu-arch", cl::desc("The GPU Architecture to target"),		GPUArchChoice("polly-gpu-arch", cl::desc("The GPU Architecture to target"),
cl::values(clEnumValN(GPUArch::NVPTX64, "nvptx64",		cl::values(clEnumValN(GPUArch::NVPTX64, "nvptx64",
"target NVIDIA 64-bit architecture")),		"target NVIDIA 64-bit architecture"),
		clEnumValN(GPUArch::SPIR32, "spir32",
		"target SPIR 32-bit architecture"),
		clEnumValN(GPUArch::SPIR64, "spir64",
		"target SPIR 64-bit architecture")),
cl::init(GPUArch::NVPTX64), cl::ZeroOrMore,		cl::init(GPUArch::NVPTX64), cl::ZeroOrMore,
cl::cat(PollyCategory));		cl::cat(PollyCategory));
#endif		#endif

VectorizerChoice polly::PollyVectorizerChoice;		VectorizerChoice polly::PollyVectorizerChoice;
static cl::opt<polly::VectorizerChoice, true> Vectorizer(		static cl::opt<polly::VectorizerChoice, true> Vectorizer(
"polly-vectorizer", cl::desc("Select the vectorization strategy"),		"polly-vectorizer", cl::desc("Select the vectorization strategy"),
cl::values(		cl::values(
▲ Show 20 Lines • Show All 340 Lines • Show Last 20 Lines

polly/trunk/test/GPGPU/spir-codegen.ll

				; RUN: opt -O3 -polly -polly-target=gpu \
				; RUN: -polly-gpu-arch=spir32 \
				; RUN: -polly-acc-dump-kernel-ir -polly-process-unprofitable -disable-output < %s \| \
				; RUN: FileCheck %s

				; REQUIRES: pollyacc

				; CHECK: target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
				; CHECK-NEXT: target triple = "spir-unknown-unknown"

				; CHECK-LABEL: define spir_kernel void @FUNC_double_parallel_loop_SCOP_0_KERNEL_0(i8 addrspace(1)* %MemRef0) #0 !kernel_arg_addr_space !0 !kernel_arg_name !1 !kernel_arg_access_qual !1 !kernel_arg_type !1 !kernel_arg_type_qual !1 !kernel_arg_base_type !1 {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: %0 = call i32 @__gen_ocl_get_group_id0()
				; CHECK-NEXT: %__gen_ocl_get_group_id0 = zext i32 %0 to i64
				; CHECK-NEXT: %1 = call i32 @__gen_ocl_get_group_id1()
				; CHECK-NEXT: %__gen_ocl_get_group_id1 = zext i32 %1 to i64
				; CHECK-NEXT: %2 = call i32 @__gen_ocl_get_local_id0()
				; CHECK-NEXT: %__gen_ocl_get_local_id0 = zext i32 %2 to i64
				; CHECK-NEXT: %3 = call i32 @__gen_ocl_get_local_id1()
				; CHECK-NEXT: %__gen_ocl_get_local_id1 = zext i32 %3 to i64
				; CHECK-NEXT: br label %polly.loop_preheader

				; CHECK-LABEL: polly.loop_exit: ; preds = %polly.stmt.bb5
				; CHECK-NEXT: ret void

				; CHECK-LABEL: polly.loop_header: ; preds = %polly.stmt.bb5, %polly.loop_preheader
				; CHECK-NEXT: %polly.indvar = phi i64 [ 0, %polly.loop_preheader ], [ %polly.indvar_next, %polly.stmt.bb5 ]
				; CHECK-NEXT: %4 = mul nsw i64 32, %__gen_ocl_get_group_id0
				; CHECK-NEXT: %5 = add nsw i64 %4, %__gen_ocl_get_local_id0
				; CHECK-NEXT: %6 = mul nsw i64 32, %__gen_ocl_get_group_id1
				; CHECK-NEXT: %7 = add nsw i64 %6, %__gen_ocl_get_local_id1
				; CHECK-NEXT: %8 = mul nsw i64 16, %polly.indvar
				; CHECK-NEXT: %9 = add nsw i64 %7, %8
				; CHECK-NEXT: br label %polly.stmt.bb5

				; CHECK-LABEL: polly.stmt.bb5: ; preds = %polly.loop_header
				; CHECK-NEXT: %10 = mul i64 %5, %9
				; CHECK-NEXT: %p_tmp6 = sitofp i64 %10 to float
				; CHECK-NEXT: %polly.access.cast.MemRef0 = bitcast i8 addrspace(1)* %MemRef0 to float addrspace(1)*
				; CHECK-NEXT: %11 = mul nsw i64 32, %__gen_ocl_get_group_id0
				; CHECK-NEXT: %12 = add nsw i64 %11, %__gen_ocl_get_local_id0
				; CHECK-NEXT: %polly.access.mul.MemRef0 = mul nsw i64 %12, 1024
				; CHECK-NEXT: %13 = mul nsw i64 32, %__gen_ocl_get_group_id1
				; CHECK-NEXT: %14 = add nsw i64 %13, %__gen_ocl_get_local_id1
				; CHECK-NEXT: %15 = mul nsw i64 16, %polly.indvar
				; CHECK-NEXT: %16 = add nsw i64 %14, %15
				; CHECK-NEXT: %polly.access.add.MemRef0 = add nsw i64 %polly.access.mul.MemRef0, %16
				; CHECK-NEXT: %polly.access.MemRef0 = getelementptr float, float addrspace(1)* %polly.access.cast.MemRef0, i64 %polly.access.add.MemRef0
				; CHECK-NEXT: %tmp8_p_scalar_ = load float, float addrspace(1)* %polly.access.MemRef0, align 4
				; CHECK-NEXT: %p_tmp9 = fadd float %tmp8_p_scalar_, %p_tmp6
				; CHECK-NEXT: %polly.access.cast.MemRef01 = bitcast i8 addrspace(1)* %MemRef0 to float addrspace(1)*
				; CHECK-NEXT: %17 = mul nsw i64 32, %__gen_ocl_get_group_id0
				; CHECK-NEXT: %18 = add nsw i64 %17, %__gen_ocl_get_local_id0
				; CHECK-NEXT: %polly.access.mul.MemRef02 = mul nsw i64 %18, 1024
				; CHECK-NEXT: %19 = mul nsw i64 32, %__gen_ocl_get_group_id1
				; CHECK-NEXT: %20 = add nsw i64 %19, %__gen_ocl_get_local_id1
				; CHECK-NEXT: %21 = mul nsw i64 16, %polly.indvar
				; CHECK-NEXT: %22 = add nsw i64 %20, %21
				; CHECK-NEXT: %polly.access.add.MemRef03 = add nsw i64 %polly.access.mul.MemRef02, %22
				; CHECK-NEXT: %polly.access.MemRef04 = getelementptr float, float addrspace(1)* %polly.access.cast.MemRef01, i64 %polly.access.add.MemRef03
				; CHECK-NEXT: store float %p_tmp9, float addrspace(1)* %polly.access.MemRef04, align 4
				; CHECK-NEXT: %polly.indvar_next = add nsw i64 %polly.indvar, 1
				; CHECK-NEXT: %polly.loop_cond = icmp sle i64 %polly.indvar_next, 1
				; CHECK-NEXT: br i1 %polly.loop_cond, label %polly.loop_header, label %polly.loop_exit

				; CHECK-LABEL: polly.loop_preheader: ; preds = %entry
				; CHECK-NEXT: br label %polly.loop_header

				; CHECK: attributes #0 = { "polly.skip.fn" }

				; void double_parallel_loop(float A[][1024]) {
				; for (long i = 0; i < 1024; i++)
				; for (long j = 0; j < 1024; j++)
				; A[i][j] += i * j;
				; }
				;
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				define void @double_parallel_loop([1024 x float]* %A) {
				bb:
				br label %bb2

				bb2: ; preds = %bb13, %bb
				%i.0 = phi i64 [ 0, %bb ], [ %tmp14, %bb13 ]
				%exitcond1 = icmp ne i64 %i.0, 1024
				br i1 %exitcond1, label %bb3, label %bb15

				bb3: ; preds = %bb2
				br label %bb4

				bb4: ; preds = %bb10, %bb3
				%j.0 = phi i64 [ 0, %bb3 ], [ %tmp11, %bb10 ]
				%exitcond = icmp ne i64 %j.0, 1024
				br i1 %exitcond, label %bb5, label %bb12

				bb5: ; preds = %bb4
				%tmp = mul nuw nsw i64 %i.0, %j.0
				%tmp6 = sitofp i64 %tmp to float
				%tmp7 = getelementptr inbounds [1024 x float], [1024 x float]* %A, i64 %i.0, i64 %j.0
				%tmp8 = load float, float* %tmp7, align 4
				%tmp9 = fadd float %tmp8, %tmp6
				store float %tmp9, float* %tmp7, align 4
				br label %bb10

				bb10: ; preds = %bb5
				%tmp11 = add nuw nsw i64 %j.0, 1
				br label %bb4

				bb12: ; preds = %bb4
				br label %bb13

				bb13: ; preds = %bb12
				%tmp14 = add nuw nsw i64 %i.0, 1
				br label %bb2

				bb15: ; preds = %bb2
				ret void
				}

polly/trunk/tools/GPURuntime/GPUJIT.c

Show All 17 Lines
#include <cuda_runtime.h>		#include <cuda_runtime.h>
#endif /* HAS_LIBCUDART */		#endif /* HAS_LIBCUDART */

#ifdef HAS_LIBOPENCL		#ifdef HAS_LIBOPENCL
#ifdef __APPLE__		#ifdef __APPLE__
#include <OpenCL/opencl.h>		#include <OpenCL/opencl.h>
#else		#else
#include <CL/cl.h>		#include <CL/cl.h>
#endif		#endif /* __APPLE__ */
#endif /* HAS_LIBOPENCL */		#endif /* HAS_LIBOPENCL */

#include <dlfcn.h>		#include <dlfcn.h>
#include <stdarg.h>		#include <stdarg.h>
#include <stdio.h>		#include <stdio.h>
#include <string.h>		#include <string.h>
		#include <unistd.h>

static int DebugMode;		static int DebugMode;
static int CacheMode;		static int CacheMode;

static PollyGPURuntime Runtime = RUNTIME_NONE;		static PollyGPURuntime Runtime = RUNTIME_NONE;

static void debug_print(const char *format, ...) {		static void debug_print(const char *format, ...) {
if (!DebugMode)		if (!DebugMode)
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
};		};

struct OpenCLDevicePtrT {		struct OpenCLDevicePtrT {
cl_mem MemObj;		cl_mem MemObj;
};		};

/* Dynamic library handles for the OpenCL runtime library. */		/* Dynamic library handles for the OpenCL runtime library. */
static void *HandleOpenCL;		static void *HandleOpenCL;
		static void *HandleOpenCLBeignet;

/* Type-defines of function pointer to OpenCL Runtime API. */		/* Type-defines of function pointer to OpenCL Runtime API. */
typedef cl_int clGetPlatformIDsFcnTy(cl_uint NumEntries,		typedef cl_int clGetPlatformIDsFcnTy(cl_uint NumEntries,
cl_platform_id *Platforms,		cl_platform_id *Platforms,
cl_uint *NumPlatforms);		cl_uint *NumPlatforms);
static clGetPlatformIDsFcnTy *clGetPlatformIDsFcnPtr;		static clGetPlatformIDsFcnTy *clGetPlatformIDsFcnPtr;

typedef cl_int clGetDeviceIDsFcnTy(cl_platform_id Platform,		typedef cl_int clGetDeviceIDsFcnTy(cl_platform_id Platform,
Show All 34 Lines

typedef cl_int		typedef cl_int
clEnqueueWriteBufferFcnTy(cl_command_queue CommandQueue, cl_mem Buffer,		clEnqueueWriteBufferFcnTy(cl_command_queue CommandQueue, cl_mem Buffer,
cl_bool BlockingWrite, size_t Offset, size_t Size,		cl_bool BlockingWrite, size_t Offset, size_t Size,
const void *Ptr, cl_uint NumEventsInWaitList,		const void *Ptr, cl_uint NumEventsInWaitList,
const cl_event EventWaitList, cl_event Event);		const cl_event EventWaitList, cl_event Event);
static clEnqueueWriteBufferFcnTy *clEnqueueWriteBufferFcnPtr;		static clEnqueueWriteBufferFcnTy *clEnqueueWriteBufferFcnPtr;

		typedef cl_program
		clCreateProgramWithLLVMIntelFcnTy(cl_context Context, cl_uint NumDevices,
		const cl_device_id *DeviceList,
		const char Filename, cl_int ErrcodeRet);
		static clCreateProgramWithLLVMIntelFcnTy *clCreateProgramWithLLVMIntelFcnPtr;

typedef cl_program clCreateProgramWithBinaryFcnTy(		typedef cl_program clCreateProgramWithBinaryFcnTy(
cl_context Context, cl_uint NumDevices, const cl_device_id *DeviceList,		cl_context Context, cl_uint NumDevices, const cl_device_id *DeviceList,
const size_t Lengths, const unsigned char Binaries, cl_int BinaryStatus,		const size_t Lengths, const unsigned char Binaries, cl_int BinaryStatus,
cl_int *ErrcodeRet);		cl_int *ErrcodeRet);
static clCreateProgramWithBinaryFcnTy *clCreateProgramWithBinaryFcnPtr;		static clCreateProgramWithBinaryFcnTy *clCreateProgramWithBinaryFcnPtr;

typedef cl_int clBuildProgramFcnTy(		typedef cl_int clBuildProgramFcnTy(
cl_program Program, cl_uint NumDevices, const cl_device_id *DeviceList,		cl_program Program, cl_uint NumDevices, const cl_device_id *DeviceList,
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	static void getAPIHandleCL(void Handle, const char *FuncName) {
if ((Err = dlerror()) != 0) {		if ((Err = dlerror()) != 0) {
fprintf(stderr, "Load OpenCL Runtime API failed: %s. \n", Err);		fprintf(stderr, "Load OpenCL Runtime API failed: %s. \n", Err);
return 0;		return 0;
}		}
return FuncPtr;		return FuncPtr;
}		}

static int initialDeviceAPILibrariesCL() {		static int initialDeviceAPILibrariesCL() {
		HandleOpenCLBeignet = dlopen("/usr/local/lib/beignet/libcl.so", RTLD_LAZY);
HandleOpenCL = dlopen("libOpenCL.so", RTLD_LAZY);		HandleOpenCL = dlopen("libOpenCL.so", RTLD_LAZY);
if (!HandleOpenCL) {		if (!HandleOpenCL) {
fprintf(stderr, "Cannot open library: %s. \n", dlerror());		fprintf(stderr, "Cannot open library: %s. \n", dlerror());
return 0;		return 0;
}		}
return 1;		return 1;
}		}

Show All 11 Lines
* http://pubs.opengroup.org/onlinepubs/9699919799/functions/dlsym.html		* http://pubs.opengroup.org/onlinepubs/9699919799/functions/dlsym.html
*/		*/
#pragma GCC diagnostic push		#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wpedantic"		#pragma GCC diagnostic ignored "-Wpedantic"
static int initialDeviceAPIsCL() {		static int initialDeviceAPIsCL() {
if (initialDeviceAPILibrariesCL() == 0)		if (initialDeviceAPILibrariesCL() == 0)
return 0;		return 0;

		// FIXME: We are now always selecting the Intel Beignet driver if it is
		// available on the system, instead of a possible NVIDIA or AMD OpenCL
		// API. This selection should occurr based on the target architecture
		// chosen when compiling.
		void *Handle =
		(HandleOpenCLBeignet != NULL ? HandleOpenCLBeignet : HandleOpenCL);

clGetPlatformIDsFcnPtr =		clGetPlatformIDsFcnPtr =
(clGetPlatformIDsFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetPlatformIDs");		(clGetPlatformIDsFcnTy *)getAPIHandleCL(Handle, "clGetPlatformIDs");

clGetDeviceIDsFcnPtr =		clGetDeviceIDsFcnPtr =
(clGetDeviceIDsFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetDeviceIDs");		(clGetDeviceIDsFcnTy *)getAPIHandleCL(Handle, "clGetDeviceIDs");

clGetDeviceInfoFcnPtr =		clGetDeviceInfoFcnPtr =
(clGetDeviceInfoFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetDeviceInfo");		(clGetDeviceInfoFcnTy *)getAPIHandleCL(Handle, "clGetDeviceInfo");

clGetKernelInfoFcnPtr =		clGetKernelInfoFcnPtr =
(clGetKernelInfoFcnTy *)getAPIHandleCL(HandleOpenCL, "clGetKernelInfo");		(clGetKernelInfoFcnTy *)getAPIHandleCL(Handle, "clGetKernelInfo");

clCreateContextFcnPtr =		clCreateContextFcnPtr =
(clCreateContextFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateContext");		(clCreateContextFcnTy *)getAPIHandleCL(Handle, "clCreateContext");

clCreateCommandQueueFcnPtr = (clCreateCommandQueueFcnTy *)getAPIHandleCL(		clCreateCommandQueueFcnPtr = (clCreateCommandQueueFcnTy *)getAPIHandleCL(
HandleOpenCL, "clCreateCommandQueue");		Handle, "clCreateCommandQueue");

clCreateBufferFcnPtr =		clCreateBufferFcnPtr =
(clCreateBufferFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateBuffer");		(clCreateBufferFcnTy *)getAPIHandleCL(Handle, "clCreateBuffer");

clEnqueueWriteBufferFcnPtr = (clEnqueueWriteBufferFcnTy *)getAPIHandleCL(		clEnqueueWriteBufferFcnPtr = (clEnqueueWriteBufferFcnTy *)getAPIHandleCL(
HandleOpenCL, "clEnqueueWriteBuffer");		Handle, "clEnqueueWriteBuffer");

		if (HandleOpenCLBeignet)
		clCreateProgramWithLLVMIntelFcnPtr =
		(clCreateProgramWithLLVMIntelFcnTy *)getAPIHandleCL(
		Handle, "clCreateProgramWithLLVMIntel");

clCreateProgramWithBinaryFcnPtr =		clCreateProgramWithBinaryFcnPtr =
(clCreateProgramWithBinaryFcnTy *)getAPIHandleCL(		(clCreateProgramWithBinaryFcnTy *)getAPIHandleCL(
HandleOpenCL, "clCreateProgramWithBinary");		Handle, "clCreateProgramWithBinary");

clBuildProgramFcnPtr =		clBuildProgramFcnPtr =
(clBuildProgramFcnTy *)getAPIHandleCL(HandleOpenCL, "clBuildProgram");		(clBuildProgramFcnTy *)getAPIHandleCL(Handle, "clBuildProgram");

clCreateKernelFcnPtr =		clCreateKernelFcnPtr =
(clCreateKernelFcnTy *)getAPIHandleCL(HandleOpenCL, "clCreateKernel");		(clCreateKernelFcnTy *)getAPIHandleCL(Handle, "clCreateKernel");

clSetKernelArgFcnPtr =		clSetKernelArgFcnPtr =
(clSetKernelArgFcnTy *)getAPIHandleCL(HandleOpenCL, "clSetKernelArg");		(clSetKernelArgFcnTy *)getAPIHandleCL(Handle, "clSetKernelArg");

clEnqueueNDRangeKernelFcnPtr = (clEnqueueNDRangeKernelFcnTy *)getAPIHandleCL(		clEnqueueNDRangeKernelFcnPtr = (clEnqueueNDRangeKernelFcnTy *)getAPIHandleCL(
HandleOpenCL, "clEnqueueNDRangeKernel");		Handle, "clEnqueueNDRangeKernel");

clEnqueueReadBufferFcnPtr = (clEnqueueReadBufferFcnTy *)getAPIHandleCL(		clEnqueueReadBufferFcnPtr =
HandleOpenCL, "clEnqueueReadBuffer");		(clEnqueueReadBufferFcnTy *)getAPIHandleCL(Handle, "clEnqueueReadBuffer");

clFlushFcnPtr = (clFlushFcnTy *)getAPIHandleCL(HandleOpenCL, "clFlush");		clFlushFcnPtr = (clFlushFcnTy *)getAPIHandleCL(Handle, "clFlush");

clFinishFcnPtr = (clFinishFcnTy *)getAPIHandleCL(HandleOpenCL, "clFinish");		clFinishFcnPtr = (clFinishFcnTy *)getAPIHandleCL(Handle, "clFinish");

clReleaseKernelFcnPtr =		clReleaseKernelFcnPtr =
(clReleaseKernelFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseKernel");		(clReleaseKernelFcnTy *)getAPIHandleCL(Handle, "clReleaseKernel");

clReleaseProgramFcnPtr =		clReleaseProgramFcnPtr =
(clReleaseProgramFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseProgram");		(clReleaseProgramFcnTy *)getAPIHandleCL(Handle, "clReleaseProgram");

clReleaseMemObjectFcnPtr = (clReleaseMemObjectFcnTy *)getAPIHandleCL(		clReleaseMemObjectFcnPtr =
HandleOpenCL, "clReleaseMemObject");		(clReleaseMemObjectFcnTy *)getAPIHandleCL(Handle, "clReleaseMemObject");

clReleaseCommandQueueFcnPtr = (clReleaseCommandQueueFcnTy *)getAPIHandleCL(		clReleaseCommandQueueFcnPtr = (clReleaseCommandQueueFcnTy *)getAPIHandleCL(
HandleOpenCL, "clReleaseCommandQueue");		Handle, "clReleaseCommandQueue");

clReleaseContextFcnPtr =		clReleaseContextFcnPtr =
(clReleaseContextFcnTy *)getAPIHandleCL(HandleOpenCL, "clReleaseContext");		(clReleaseContextFcnTy *)getAPIHandleCL(Handle, "clReleaseContext");

return 1;		return 1;
}		}
#pragma GCC diagnostic pop		#pragma GCC diagnostic pop

/* Context and Device. */		/* Context and Device. */
static PollyGPUContext *GlobalContext = NULL;		static PollyGPUContext *GlobalContext = NULL;
static cl_device_id GlobalDeviceID = NULL;		static cl_device_id GlobalDeviceID = NULL;
▲ Show 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	static PollyGPUFunction getKernelCL(const char BinaryBuffer,
}		}

if (!GlobalDeviceID) {		if (!GlobalDeviceID) {
fprintf(stderr, "GPGPU-code generation not initialized correctly.\n");		fprintf(stderr, "GPGPU-code generation not initialized correctly.\n");
exit(-1);		exit(-1);
}		}

cl_int Ret;		cl_int Ret;

		if (HandleOpenCLBeignet) {
		// TODO: This is a workaround, since clCreateProgramWithLLVMIntel only
		// accepts a filename to a valid llvm-ir file as an argument, instead
		// of accepting the BinaryBuffer directly.
		FILE *fp = fopen("kernel.ll", "wb");
		if (fp != NULL) {
		fputs(BinaryBuffer, fp);
		fclose(fp);
		}

		((OpenCLKernel *)Function->Kernel)->Program =
		clCreateProgramWithLLVMIntelFcnPtr(
		((OpenCLContext *)GlobalContext->Context)->Context, 1,
		&GlobalDeviceID, "kernel.ll", &Ret);
		checkOpenCLError(Ret, "Failed to create program from llvm.\n");
		unlink("kernel.ll");
		} else {
size_t BinarySize = strlen(BinaryBuffer);		size_t BinarySize = strlen(BinaryBuffer);
((OpenCLKernel *)Function->Kernel)->Program = clCreateProgramWithBinaryFcnPtr(		((OpenCLKernel *)Function->Kernel)->Program =
((OpenCLContext *)GlobalContext->Context)->Context, 1, &GlobalDeviceID,		clCreateProgramWithBinaryFcnPtr(
(const size_t )&BinarySize, (const unsigned char *)&BinaryBuffer, NULL,		((OpenCLContext *)GlobalContext->Context)->Context, 1,
&Ret);		&GlobalDeviceID, (const size_t *)&BinarySize,
		(const unsigned char **)&BinaryBuffer, NULL, &Ret);
checkOpenCLError(Ret, "Failed to create program from binary.\n");		checkOpenCLError(Ret, "Failed to create program from binary.\n");
		}

Ret = clBuildProgramFcnPtr(((OpenCLKernel *)Function->Kernel)->Program, 1,		Ret = clBuildProgramFcnPtr(((OpenCLKernel *)Function->Kernel)->Program, 1,
&GlobalDeviceID, NULL, NULL, NULL);		&GlobalDeviceID, NULL, NULL, NULL);
checkOpenCLError(Ret, "Failed to build program.\n");		checkOpenCLError(Ret, "Failed to build program.\n");

((OpenCLKernel *)Function->Kernel)->Kernel = clCreateKernelFcnPtr(		((OpenCLKernel *)Function->Kernel)->Kernel = clCreateKernelFcnPtr(
((OpenCLKernel *)Function->Kernel)->Program, KernelName, &Ret);		((OpenCLKernel *)Function->Kernel)->Program, KernelName, &Ret);
checkOpenCLError(Ret, "Failed to create kernel.\n");		checkOpenCLError(Ret, "Failed to create kernel.\n");
▲ Show 20 Lines • Show All 1,154 Lines • Show Last 20 Lines