diff --git a/openmp/docs/optimizations/OpenMPOpt.rst b/openmp/docs/optimizations/OpenMPOpt.rst --- a/openmp/docs/optimizations/OpenMPOpt.rst +++ b/openmp/docs/optimizations/OpenMPOpt.rst @@ -1,13 +1,109 @@ +========================== OpenMP-Aware Optimizations ========================== +LLVM, since `version 11 `_ (12 +Oct 2020), supports an :ref:`OpenMP-Aware optimization pass `. This +optimization pass will attempt to optimize the module with OpenMP-specific +domain-knowledge. This pass is enabled by default at high optimization levels +(O2 / O3) if compiling with OpenMP support enabled. + +.. _OpenMPOpt: + OpenMPOpt ---------- +========= + +.. contents:: + :local: + :depth: 1 + +OpenMPOpt contains several OpenMP-Aware optimizations. This pass is run early on +the entire Module, and later on the entire call graph. Most optimizations done +by OpenMPOpt support remarks. Optimization remarks can be enabled by compiling +with the following flags. + +.. code-block:: console + + $ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt + +OpenMP Runtime Call Deduplication +--------------------------------- + +The OpenMP runtime library contains several functions used to implement features +of the OpenMP standard. Several of the runtime calls are constant within a +parallel region. A common optimization is to replace invariant code with a +single reference, but in this case the compiler will only see an opaque call +into the runtime library. To get around this, OpenMPOpt maintains a list of +OpenMP runtime functions that are constant and will manually deduplicate them. + +Globalization +------------- + +The OpenMP standard requires that data can be shared between different threads. +This requirement poses a unique challenge when offloading to GPU accelerators. +Data cannot be shared between the threads in a GPU by default, in order to do +this it must either be placed in global or shared memory. This needs to be done +every time a variable may potentially be shared in order to create correct +OpenMP programs. Unfortunately, this has significant performance implications +and is not needed in the majority of cases. For example, when Clang is +generating code for this offloading region, it will see that the variable `x` +escapes and is potentially shared. This will require globalizing the variable, +which means it cannot reside in the registers on the device. + +.. code-block:: c++ + + void use(void *) { } + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +In many cases, this transformation is not actually necessary but still carries a +significant performance penalty. Because of this, OpenMPOpt can perform and +inter-procedural optimization and scan each known usage of the globalized +variable and determine if it is potentially captured and shared by another +thread. If it is not actually captured, it can safely be moved back to fast +register memory. + +Another case is memory that is intentionally shared between the threads, but is +shared from one thread to all the others. Such variables can be moved to shared +memory when compiled without needing to go through the runtime library. This +allows for users to confidently declare shared memory on the device without +needing to use custom OpenMP allocators or rely on the runtime. + + +.. code-block:: c++ + + static void share(void *); + + static void foo() { + int x[64]; + #pragma omp parallel + share(x); + } + + int main() { + #pragma omp target + foo(); + } +These optimizations can have very large performance implications. Both of these +optimizations rely heavily on inter-procedural analysis. Because of this, +offloading applications should ideally be contained in a single translation unit +and functions should not be externally visible unless needed. OpenMPOpt will +inform the user if any globalization calls remain if remarks are enabled. This +should be treated as a defect in the program. Resources ---------- +========= +- 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4 - 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU - 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30