diff --git a/clang/docs/SYCLSupport.rst b/clang/docs/SYCLSupport.rst new file mode 100644 --- /dev/null +++ b/clang/docs/SYCLSupport.rst @@ -0,0 +1,240 @@ + +SYCL Compiler and Runtime architecture design +============================================= + +Introduction +------------ + +This document describes the architecture of the SYCL compiler and runtime +library. More details are provided in +`external document `_\ , +which are going to be added to clang documentation in the future. + +SYCL Compiler architecture +-------------------------- + +SYCL application compilation flow for ``spir``/``spir64`` target: + + +.. image:: images/Compiler-HLD.svg + :target: images/Compiler-HLD.svg + :alt: High level component diagram for SYCL Compiler + + +*Diagram 1. Application build flow.* + +SYCL compiler logically can be split into the host compiler and a number of +device compilers—one per each supported target. Clang driver orchestrates the +compilation process, it will invoke the device compiler once per each requested +target, then it will invoke the host compiler to compile the host part of a +SYCL source. In the simplest case, when compilation and linkage are done in one +compiler driver invocation, once compilation is finished, the device object +files (which are really LLVM IR files) are linked with the ``llvm-link`` tool. +The resulting LLVM IR module can additionally processed to get program format +suitable for execution by specific back-end. For instance, ``llvm-spirv`` tool +translates LLVM bitcode to SPIR-V module for OpenCL back-end or ``ptxas``/ +``fatbin`` tools prepare LLVM for execution via CUDA back-end. After that +``clang-offload-wrapper`` tool wraps device module in a host object file. Once +all the host object files and the wrapped object with device code are ready, the +driver invokes the usual platform linker and the final executable called "fat +binary" is produced. This is a host executable or library with embedded linked +images for each target specified at the command line. + +There are many variations of the compilation process depending on whether user +chose to do one or more of the following: + +* perform compilation separately from linkage +* compile the device module ahead-of-time for one or more targets +* perform device code splitting so that device code is distributed across + multiple modules rather than enclosed in a single one +* perform linkage of static device libraries + Sections below provide more details on some of those scenarios. + +SYCL sources can be also compiled as a regular C++ code, in this mode there is +no "device part" of the code — everything is executed on the host. + +Device compiler is further split into the following major components: + + +* **Front-end** - parses input source, "outlines" device part of the code, + applies additional restrictions on the device code (e.g. no exceptions or + virtual calls), generates LLVM IR for the device code only and an "integration + header" which provides information like kernel name, parameters order and data + type for the runtime library. +* **Middle-end** - transforms the initial LLVM IR to get consumed by the + back-end. Today middle-end transformations include subset of LLVM standard + passes: + + * Any LLVM IR transformation can be applied with only one limitation: + the back-end compiler should be able to handle the transformed LLVM IR. + NOTE: the performance impact of transformation passes depends on accurate + target information, so it makes sense to disable such transformation for + "virtual" targets like SPIR. + * Optionally: Address space inference pass + * Optionally: LLVM IR -> SPIR-V translator or LLVM -> PTX. + +* **Back-end** - produces native "device" code. It is shown as + "Target-specific LLVM compiler" box on Diagram 1. It is invoked either at + compile time (in ahead-of-time compilation scenario) or at runtime + (in just-in-time compilation scenario). + +*Design note: in current design we use SYCL device front-end compiler to produce +the integration header for two reasons. First, it must be possible to use any +host compiler to produce SYCL heterogeneous applications. Second, even if the +same Clang compiler is used for the host compilation, information provided in +the integration header is used (included) by the SYCL runtime implementation, so +the header must be available before the host compilation starts.* + +SYCL support in Clang front-end +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +SYCL support in Clang front-end can be split into the following components: + +* Device code outlining. This component is responsible for identifying and + outlining "device code" in the single source. +* SYCL kernel function object (functor or lambda) lowering. This component + creates a SPIR kernel function interface for SYCL kernels. +* Device code diagnostics. This component enforces language restrictions on + device code. +* Integration header generation. This component emits information required for + binding host and device parts of the SYCL code. + +Device code outlining +~~~~~~~~~~~~~~~~~~~~~ + +Here is a code example of a SYCL program that demonstrates compiler outlining +work: + +.. code-block:: C++ + + int foo(int x) { return ++x; } + int bar(int x) { throw std::exception{"CPU code only!"}; } + ... + using namespace sycl; + queue Q; + buffer a{range<1>{1024}}; + Q.submit([&](handler& cgh) { + auto A = a.get_access(cgh); + cgh.parallel_for(range<1>{1024}, [=](id<1> index) { + A[index] = index[0] * 2 + foo(42); + }); + } + ... + +In this example, the compiler needs to compile the lambda expression passed +to the ``sycl::handler::parallel_for`` method, as well as the function ``foo`` +called from the lambda expression for the device. + +The compiler must also ignore the ``bar`` function when we compile the +"device" part of the single source code, as it's unused inside the device +portion of the source code (the contents of the lambda expression passed to the +``sycl::handler::parallel_for`` and any function called from this lambda +expression). + +The current approach is to use the SYCL kernel attribute in the runtime to +mark code passed to ``sycl::handler::parallel_for`` as "kernel functions". The +runtime library can't mark foo as "device" code - this is a compiler job: to +traverse all symbols accessible from kernel functions and add them to the +"device part" of the code marking them with the new SYCL device attribute. + +Lowering of lambda function objects and named function objects +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +SYCL memory objects shared between host and device can be accessed either +through the use of raw pointers to unified memory (known as USM +https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm) +or special ``accessor`` classes. SYCL device compiler sets address space +attribute for raw pointers captured by SYCL kernels. ``accessor`` classes +require additional processing as the "device" implementation of this class +contains pointers to the device memory as a class member. +`OpenCL SPIR-V environment specification `_ +doesn't allow passing structures with pointer type members as kernel parameters. + +SYCL also has a special mechanism for passing kernel arguments from host to the +device. In OpenCL kernel arguments are set by calling ``clSetKernelArg`` +function for each kernel argument, meanwhile in SYCL all the kernel arguments +are fields of "SYCL kernel function" which can be defined as a lambda function +or a named function object and passed as an argument to SYCL function for +invoking kernels (such as ``parallel_for`` or ``single_task``\ ). For example, +in the previous code snippet above ``accessor`` ``A`` is one such captured +kernel argument. + +To facilitate the mapping of SYCL kernel data members to SPIR-like kernel +arguments we added the generation of an SPIR kernel function inside the +compiler. A SPIR-like kernel function contains the body of the SYCL kernel +function, receives OpenCL-like parameters and additionally does some +manipulation to initialize SYCL kernel data members with these parameters. +A PTX / CUDA backend uses the same lowering logic, but generated LLVM IR has +minor differences (e.g. the identification of PTX kernels is different). + +The pseudo code for SPIR kernel function generated by the compiler looks like +this: + +.. code-block:: C++ + + // Generated kernel function (expressed in OpenCL-like pseudo-code) + __kernel KernelName(global int* a) { + KernelType KernelFuncObj; // Actually kernel function object declaration + // doesn't have a name in AST. + // Let the kernel function object have one captured field - accessor A. + // We need to init it with global pointer from arguments: + KernelFuncObj.A.__init(a); + // Body of the SYCL kernel from SYCL headers: + { + KernelFuncObj(); + } + } + +The compiler generates such kernel function for an instantiation of a function +template with ``sycl_kernel`` attribute. SYCL kernel invocation methods can use +following helper function template to lower SYCL kernel function object to +SPIR-like kernel function: + +.. code-block:: C++ + + // SYCL accessor class template + namespace sycl { + template + class accessor { + #ifdef __SYCL_DEVICE_ONLY__ + private: + __global T* p; + // accessor initialization method + __init(__global int* a/*, ...*/) { p = a; [...] } + #endif // __SYCL_DEVICE_ONLY__ + }; + } + + // SYCL kernel is a function object + class MyObj { + accessor A; // accessor contains a pointer to the global address space. + public: + void operator()(); // Body of the kernel + }; + + // Helper function template for kernel detection by the compiler + template + __attribute__((sycl_kernel)) void sycl_kernel_function(KernelType KernelFuncObj) { + // ... + KernelFuncObj(); + } + + [...] + MyObj Obj{/*some init*/}; + // The compiler will generate kernel function for MyOjb similar to the example above + sycl_kernel_function(Obj); + +The compiler also emits the integration header with the information required for +marshalling SPIR-like kernel arguments by the runtime library. Kernel function +and integration header are generated by the compiler from ``Sema`` library using +AST nodes. + +Additional details of kernel parameter passing may be found in the document +`SYCL Kernel Parameter Handling and Array Support `_. + +TODO +---- + +* Consider unifying the design among GPGPU programming models like SYCL, HIP, + CUDA and OpenMP-offload. In particular, device code outlining is the area + where the difference among programming models might be minimal. diff --git a/clang/docs/images/Compiler-HLD.svg b/clang/docs/images/Compiler-HLD.svg new file mode 100644 --- /dev/null +++ b/clang/docs/images/Compiler-HLD.svg @@ -0,0 +1,16522 @@ + +image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Page-1 + + Task.136llvm-linkSheet.141Sheet.145Sheet.146Sheet.147Sheet.148Sheet.149Sheet.150Sheet.154Sheet.155Sheet.156Sheet.157Sheet.158Task.176Offload-wrapperSheet.181Sheet.185Sheet.186Sheet.187Sheet.188Sheet.189Sheet.190Sheet.194Sheet.195Sheet.196Sheet.197Sheet.198Offload-wrapper + + +Sheet.137llvm-link + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object + SourceFile.cpp + + Sheet.2 + + + + + + + Sheet.3 + + + + + SourceFile.cpp + + + + + + Dynamic Connector + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task + Compiler driver + + Sheet.5 + + + + + + + Sheet.6 + + Sheet.7 + + + + Sheet.8 + + + + + Sheet.9 + + + + + + + + + + + Sheet.10 + + Sheet.11 + + + + Sheet.12 + + + + Sheet.13 + + + + + + + Sheet.14 + + + + + + + + + + + Sheet.15 + + Sheet.16 + + + + Sheet.17 + + + Sheet.18 + + + Sheet.19 + + Sheet.20 + + + + Sheet.21 + + + + Sheet.22 + + + Sheet.23 + + + Sheet.24 + + + Sheet.25 + + + Sheet.26 + + + + + + + + + Compiler driver + + + + + + Dynamic Connector.51 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.28 + SYCL device front-end compiler + + Sheet.29 + + + + + + + Sheet.30 + + Sheet.31 + + + + Sheet.32 + + + + + Sheet.33 + + + + + + + + + + + Sheet.34 + + Sheet.35 + + + + Sheet.36 + + + + Sheet.37 + + + + + + + Sheet.38 + + + + + + + + + + + Sheet.39 + + Sheet.40 + + + + Sheet.41 + + + Sheet.42 + + + Sheet.43 + + Sheet.44 + + + + Sheet.45 + + + + Sheet.46 + + + Sheet.47 + + + Sheet.48 + + + Sheet.49 + + + Sheet.50 + + + + + + + + + SYCL device front-end compiler + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.52 + C++ host compiler + + Sheet.53 + + + + + + + Sheet.54 + + Sheet.55 + + + + Sheet.56 + + + + + Sheet.57 + + + + + + + + + + + Sheet.58 + + Sheet.59 + + + + Sheet.60 + + + + Sheet.61 + + + + + + + Sheet.62 + + + + + + + + + + + Sheet.63 + + Sheet.64 + + + + Sheet.65 + + + Sheet.66 + + + Sheet.67 + + Sheet.68 + + + + Sheet.69 + + + + Sheet.70 + + + Sheet.71 + + + Sheet.72 + + + Sheet.73 + + + Sheet.74 + + + + + + + + + C++ host compiler + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.76 + LLVM IR + + Sheet.77 + + + + + + + Sheet.78 + + + + + LLVM IR + + + + + + Dynamic Connector.127 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.104 + Target specific LLVM compiler + + Sheet.105 + + + + + + + Sheet.106 + + + Sheet.108 + + + + + Sheet.109 + + + + + + + + Sheet.113 + + + + + + + Sheet.114 + + + + + + + + + + + Sheet.115 + + Sheet.116 + + + + Sheet.117 + + + Sheet.118 + + + + Sheet.122 + + + Sheet.123 + + + Sheet.124 + + + Sheet.125 + + + Sheet.126 + + + + + + + + + Target specific LLVM compiler + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.128 + Target binary + + Sheet.129 + + + + + + + Sheet.130 + + + + + Target binary + + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.165 + Host object file + + Sheet.166 + + + + + + + Sheet.167 + + + + + Host object file + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.176 + Offload-wrapper + + Sheet.177 + + + + + + + + Sheet.181 + + + + + + + + Sheet.185 + + + + + + + Sheet.186 + + + + + + + + + + + Sheet.187 + + Sheet.188 + + + + Sheet.189 + + + Sheet.190 + + + + Sheet.194 + + + Sheet.195 + + + Sheet.196 + + + Sheet.197 + + + Sheet.198 + + + + + + + + + Offload-wrapper + + + + + + + + Dynamic Connector.205 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.206 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.228 + Integration header + + Sheet.229 + + + + + + + Sheet.230 + + + + + Integrationheader + + + + + + Dynamic Connector.231 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.232 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.76LLVM IRSheet.77Sheet.78LLVM IR + + +Data Object.76SPIRVSheet.77Sheet.78SPIRV + + + + Dynamic Connector.227 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Task.136llvm-spirvSheet.141Sheet.145Sheet.146Sheet.147Sheet.148Sheet.149Sheet.150Sheet.154Sheet.155Sheet.156Sheet.157Sheet.158Task.176Offload-wrapperSheet.181Sheet.185Sheet.186Sheet.187Sheet.188Sheet.189Sheet.190Sheet.194Sheet.195Sheet.196Sheet.197Sheet.198Offload-wrapper + + +Sheet.137llvm-spirv + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.205 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.127 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.165Host object fileSheet.166Sheet.167Wrapper object file + + + + + + + + + + + + + + + + + + + + Data Object.201 + Fat binary file + + Sheet.202 + + + + + Sheet.203 + + + Task.176Offload-wrapperSheet.181Sheet.185Sheet.186Sheet.187Sheet.188Sheet.189Sheet.190Sheet.194Sheet.195Sheet.196Sheet.197Sheet.198Sheet.177Linker + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.227 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Data Object.165Host object fileSheet.166Sheet.167Fat binary file + + + + + + + + + + + + + + + + + + + + Data Object.201 + Fat binary file + + Sheet.202 + + + + + Sheet.203 + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + (another source compiled) + + + + + + + + + + + + + + + + + + + + Data Object.165 + Host object file + + Sheet.166 + + + + + + + Sheet.167 + + + + + Host object file + + + + + Dynamic Connector.131 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + (another source compiled) + Dynamic Connector.227 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +