Added SPIR Code Generation to the PPCG Code Generator. This can be invoked using
the polly-gpu-arch flag value 'spir32' or 'spir64' for 32 and 64 bit code respectively.
In addition to that, runtime support has been added to execute said SPIR code on Intel
GPU's, where the system is equipped with Intel's open source driver Beignet (development
version). This requires the cmake flag 'USE_INTEL_OCL' to be turned on, and the polly-gpu-runtime
flag value to be 'libopencl'.
The transformation of LLVM IR to SPIR is currently quite a hack, consisting in part of regex
string transformations.
Has been tested (working) with Polybench 3.2 on an Intel i7-5500U (integrated graphics chip).
Details
- Reviewers
bollu grosser Meinersbur singam-sanjay - Commits
- rG2f3073b5cb88: [Polly][GPGPU] Added SPIR Code Generation and Corresponding Runtime Support for…
rPLO308751: [Polly][GPGPU] Added SPIR Code Generation and Corresponding Runtime Support for…
rL308751: [Polly][GPGPU] Added SPIR Code Generation and Corresponding Runtime Support for…
Diff Detail
- Build Status
Buildable 8468 Build 8468: arc lint + arc unit
Event Timeline
Hi Philipp,
this looks indeed already surprisingly good. As you noted, there are still some hacks, but I have good hopes that we can work around most of them. I added some comments. Let me know what you think!
Best,
Tobias
lib/CodeGen/PPCGCodeGeneration.cpp | ||
---|---|---|
40 | After having removed the hack below, this should hopefully not be needed any more. | |
1281 | Why do you emit NVIDIA intrinsics for SPIR code? Can you not emit an opencl barrier? If you don't know what intrinsic corresponds to this, create a simple .cl file, compile it with clang to SPIR and see which intrinsic gets emitted. | |
1819 | Instead of spreading the SPIR code all over the place, can we create one function addSPIRMetadata, that takes 'Args' as argument, emits all the metadata needed and also emits the metadata you add a little further below? | |
1919 | Why do you emit NVIDIA intrinsics. As discussed above, this there no way to emit the OpenCL intrinsics instead? | |
2253 | Interesting. So the optimization break stuff. Would be great to understand what exactly we brake! | |
2265 | It seems all these regexp hacks are needed because we don't generate the right intrinsics. What is holding us back from generating them directly? | |
tools/GPURuntime/GPUJIT.c | ||
155 | Why do these declarations need to be optional? I think they should compile easily in any situation, no? | |
223 | I am surprised. Intel's OpenCL uses such a different path. In fact, I would assume that libOpenCL.so as a generic OpenCL library would take care of loading beignet. Is this not the case? | |
288 | Does this break or does this just return nullptr if the function is not available. If we can always try and just detect if it is not around, this would be great! | |
519 | Why do we read the file from "kernel.ll" and not use the file passed to us in BinaryBuffer? |
lib/CodeGen/PPCGCodeGeneration.cpp | ||
---|---|---|
40 | That is correct, yes. This should only be needed as long as we have to manually change the string occurrences for barriers and global/local IDs. | |
1281 | The generated intrinsics are the ones declared in the SPIR reference, so for example @_Z13get_global_idj(i32 0) . I cannot find any corresponding intrinsics in LLVM's TableGen/Intrinsics, did I overlook some? This is essentially the only real point where the hack is needed. If we can insert correct intrinsics, we can ditch the regex. Because fixing this is all the regex does. | |
1819 | Will look into that. | |
1919 | Dito above. | |
2253 | Looking into that. What is clear, is that the optimized version skips a few 'dummy' computations, because the optimized kernels expect a smaller workgroup size in OpenCL. Not sure if and how we could account for that. It seems very tricky to me atm, have played around with it quite a bit. | |
2265 | Dito above. | |
tools/GPURuntime/GPUJIT.c | ||
155 | Oh yeah, my bad. Only the LLVMIntel one has to be conditional. | |
223 | The funny thing is: it does, somewhat. If I only dlopen libOpenCL and then call for example clCreateProgramWithBinary, looking at that in gdb reveals that the beignet routines get loaded by libOpenCL. However as soon as a function is not present in the standard OpenCL API, eg. clCreateProgramWithLLVMIntel, it does not like that. | |
288 | I'm gonna look into that, but iirc it didn't work. Might be wrong, will try it out again! | |
519 | Well the problem is that clCreateProgramWithLLVMIntel wants to read from a file, and requests a filename/filepath as an argument there. BinaryBuffer however holds the complete binary. If you know a simpler solution, that would of course be great, because this is another hacky-part. |
CMakeLists.txt | ||
---|---|---|
178 ↗ | (On Diff #105946) | Forgot to mention, but would be great if we could get away without a configure test here. |
lib/CodeGen/PPCGCodeGeneration.cpp | ||
1281 | No, I think they may not exist. However, you can just create declarations for such functions using code as in "GPUNodeBuilder::createCallGetKernel(". | |
1819 | I just see this is not only about the llvm::Type, but also carries an integer which is 0 for local/shared memory and 1 otherwise. So moving this into a function does not seem so easy. If you don't find an easy way, just leave it as it is. | |
2253 | It's OK. Let's get the remainder of the patch updated, then we can look at this issue later on. | |
tools/GPURuntime/GPUJIT.c | ||
519 | I see. Maybe that's fine then. Just add a comment for now. I would prefer if we could use clCreateProgramWithBinary and just pass an LLVM bitcode file, but this likely only works if the LLVM versions of beignet is very modern. Let's leave it for now like this. It's good to have something working! |
- Inserting SPIR barriers with custom function.
CMakeLists.txt | ||
---|---|---|
178 ↗ | (On Diff #105946) | Yes I agree. I don't know of a method for detecting a Beignet installation with CMake yet, but if that exists, that would of course be great. |
tools/GPURuntime/GPUJIT.c | ||
519 | Totally agreed, that would be the better option. As you pointed out, it is currently not yet possible. |
Nice. Can you potentially also add a test case?
CMakeLists.txt | ||
---|---|---|
178 ↗ | (On Diff #105946) | I don't think we want to add this to cmake at all. At best, we just compile the beignet support in unconditionally, if this is possibile. |
lib/CodeGen/PPCGCodeGeneration.cpp | ||
1986 | If you store the group names in an array, you can just index this array with "i". | |
2153 | Maybe call this insertCUDAKernelIntrinsics and insertSPIRKernelIntrincics for consistency? | |
tools/GPURuntime/GPUJIT.c | ||
223 | OK. Can you then unconditionally try to load beignet/libcl.so into another variable, e.g. HandleOpenCLBeignet? I assume this handle should be nullptr if the dlopen fails, right? Can you then check if this handle is zero and use this to decide if you want to call clCreateProgramWithLLVMIntel? This should allow us to get rid of all the conditional compilation. |
tools/GPURuntime/GPUJIT.c | ||
---|---|---|
519 | Another change. Could you possibly remove the conditional compilation here and just check if clCreateProgramWithLLVMIntelFcnPtr != nullptr? |
I am not yet too familiar with how to implement such a case and what exactly to check for, but if that is desired, I can look into it.
This should now address all concerns, minus the file-workaround.
lib/CodeGen/PPCGCodeGeneration.cpp | ||
---|---|---|
2153 | The reason I kept this separated the way it is now, without specifying "CUDA" in the earlier version, is that SPIR should be the only example. When AMD support follows, or anything where there's a registered LLVM target with corresponding intrinsics (AMDGPU in that case), the original function could be used with just an additional switch case, changing the intrinsics IDs. Do you agree with that? |
lib/CodeGen/PPCGCodeGeneration.cpp | ||
---|---|---|
2153 | And with 'example' I mean 'exception', my bad. |
Great. Yes, a test case would be great. It is not very complicated. You already updated test cases in your last patch. I would suggest you copy one of the simple double nested loop kernels from the PTX test and just check that the right SPIR intrinsics are generated! That should be enough.
Otherwise, I think this is good to go. Very nice work!
lib/CodeGen/PPCGCodeGeneration.cpp | ||
---|---|---|
2153 | Sure. Still, the function names could be more similar. What about insertKernelCallsSPIR? |
@singam-sanjay can you have a final look and commit it if you are OK with this patch.
@grosser Thanks for adding me as a reviewer ! It's helped me acquaint myself with the SPIR V codegen and understand GPUJIT better.
@PhilippSchaad I've made suggestions to restructure code, a possibly unintentional debug print statement and added some questions about the SPIR. Do let me know if (and why) some of my suggestions are unnecessary.
General Question: I was following a discussion about integrating the SPIR-V backend as a part of LLVM, i.e. moving it from /tools to lib/Target. If that were to happen, would we be able to generate and feed SPIR IR in the same way as NVPTX ?
lib/CodeGen/PPCGCodeGeneration.cpp | ||
---|---|---|
1831 | @PhilippSchaad I'm not yet acquainted with the SPIR annotations. Could you please explain why it matters to let the SPIR backend know whether a kernel argument is a read only ? | |
1833 | Why are you setting these metadata to empty strings ? Is this a feature yet to be implemented ? | |
2179 | nitpick: Do we need this ? | |
2205 | nitpick: Do we need this ? | |
2255 | Would it be better to move the SPIR specific code into createKernelASM, as well ? You might not need the switch cases to handle buggy control flow like, case GPUArch::SPIR64: case GPUArch::SPIR32: llvm_unreachable("Cannot generate ASM for SPIR architecture"); | |
tools/GPURuntime/GPUJIT.c | ||
148 | This is new to me. Why are you providing runtime support for Intel to JIT SPIR code ? Is it to run with the integrated GPU on an Intel CPU? Guessing from https://github.com/intel/beignet. | |
249 | void *Handle = ( HandleOpenCLBeignet ? HandleOpenCLBeignet : HandleOpenCL ); maybe ? Let me know if this isn't recommended. | |
251 | Is it that we prefer to use libraries provided by Intel's SDK and inegrated GPU over other OpenCL drivers and devices ? I'm assuming that this leads the runtime to use the integrated GPU even when a more powerful Radeon card is present. Please correct me if I'm wrong. | |
510 | Is this file opened inside <llvm_build>/bin or $PWD ? If it's $PWD, we should to take precautions to not open a file that the user needs. You could instead open a file in /tmp or %TEMP% in Windows, like "/tmp/R@ηdηдm3.ll". Anyways, you could implement this later. | |
516 | Was this supposed to a message letting the user know that the runtime's not using the "beignet" OpenCL implementation ? | |
519 | @grosser What is your rationale to avoid conditional compilation ? |
Will address the points mentioned asap. As for your SPIR-V question: The goal, should that back-end be added, would be to add SPIR-V compilation as an additional GPU target option (In the future). This would allow kernel execution on any SPIR-V supporting target device. This SPIR solution is essentially a 'quickfix' to get it to work on Intel, as long as SPIR-V is not an option.
lib/CodeGen/PPCGCodeGeneration.cpp | ||
---|---|---|
1831 | The issue here is the Intel Beignet driver. Beignet expects a very specific set of kernel annotations when parsing the Kernel IR. The metadata is simply required to be there, but for the kernels we generate here, it does not matter what is actually contained in there. This is entirely Beignet dependent, maybe Intel will adapt that in the future. | |
1833 | Dito above. | |
2179 | We can change it to an if-clause if that is desired. The compiler just complains when we have a switch and do not check every possible value of Arch. An alternative would be a default clause handling that llvm_unreachable. | |
2205 | Dito above. | |
2255 | Will do, good point. Originally moved it out because we thought we had to do a lot more hacking on the resulting IR. | |
tools/GPURuntime/GPUJIT.c | ||
148 | Your guess is exactly correct. This also corresponds to some of the points mentioned above. | |
249 | Definitely the prettier solution. Will change. | |
251 | This is actually a good point. We don't want to prefer the Intel provided SDK, the choice of selecting Intel's Beignet should be dependent on what -polly-gpu-arch value has been selected. AMD is in the pipeline. | |
510 | I will add this later, but yes it is $PWD. The solution with using /tmp and %TEMP% seem reasonable. | |
516 | Oops, thanks for pointing this out, this was unintentionally left in and should never be reached! Used it for debugging. Removing it. |
lib/CodeGen/PPCGCodeGeneration.cpp | ||
---|---|---|
1831 | Ohk. Thanks for the info ! | |
1833 | Alright. I guess beignet isn't worried about the names of kernel arguments in "kernel_arg_name". | |
2179 | I was talking about the break; after llvm_unreachable. If llvm_unreachable would call abort, the break would be redundant. | |
2205 | Ditto above ;-). | |
2255 | Alright. | |
tools/GPURuntime/GPUJIT.c | ||
148 | Ohk. | |
249 | void *Handle = ( HandleOpenCLBeignet!=NULL ? HandleOpenCLBeignet : HandleOpenCL ); would be better. | |
251 | Alright. Any ideas on how you'd differentiate between the integrated and dedicated GPU at command line ? | |
510 | Alright. | |
516 | 👍 |
Removed left over debug print, moved SPIR creation into createASM, fixed minor issues addressed in comments.
tools/GPURuntime/GPUJIT.c | ||
---|---|---|
251 | The goal is to run on AMD GPUs with the AMDGPU backend, which would mean the differentiation would be done with -polly-gpu-arch=spir32/amdgcn32 for example. I am not yet sure how to transport that choice to here, but I will look into that as I am working on the AMD side of things. Thanks for pointing out this flaw. |
After having removed the hack below, this should hopefully not be needed any more.