diff --git a/openmp/docs/remarks/OMP100.rst b/openmp/docs/remarks/OMP100.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP100.rst @@ -0,0 +1,26 @@ +Potentially unknown OpenMP target region caller `[OMP100]` +========================================================== + +.. _omp100: +.. _omp_no_external_caller_in_target_region: + +A function remark that indicates the function, when compiled for a GPU, is +potentially called from outside the translation unit. Note that a remark is +only issued if we tried to perform an optimization which would require us to +know all callers on the GPU. + +To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through +which the code that makes up the body of a parallel region is shared with the +threads in the team. Generally we use the address of the outlined parallel +region to identify the code that needs to be executed. If we know all target +regions that reach the parallel region we can avoid this function pointer +passing scheme and often improve the register usage on the GPU. However, If a +parallel region on the GPU is in a function with external linkage we may not +know all callers statically. If there are outside callers within target +regions, this remark is to be ignored. If there are no such callers, users can +modify the linkage and thereby help optimization with a `static` or +`__attribute__((internal))` function annotation. If changing the linkage is +impossible, e.g., because there are outside callers on the host, one can split +the function into an external visible interface which is not compiled for +the target and an internal implementation which is compiled for the target +and should be called from within the target region. diff --git a/openmp/docs/remarks/OMP110.rst b/openmp/docs/remarks/OMP110.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP110.rst @@ -0,0 +1,83 @@ +Moving globalized variable to the stack. [OMP110] +================================================= + +.. _omp110: + +This optimization remark indicates that a globalized variable was moved back to +thread-local stack memory on the device. This occurs when the optimization pass +can determine that a globalized variable is not possibly be shared between +threads and globalization was unnecessary. Using stack memory is the best-case +scenario for data globalization as the variable can now be stored in fast +register files on the device. This optimization requires full visibility of each +variable. + +Globalization typically occurs when a pointer to a thread-local variable escapes +the current scope. The compiler needs to be pessimistic and assume that the +pointer could be shared between multiple threads according to the OpenMP +standard. This is expensive on target offloading devices that do not allow +threads to share data by default. Instead, this data must be moved to memory +that can be shared, such as shared or global memory. This optimization moves the +data back from shared or global memory to thread-local stack memory if the data +is not actually shared between the threads. + +Examples +-------- + +A trivial example of globalization occurring can be seen with this example. The +compiler sees that a pointer to the thread-local variable ``x`` escapes the +current scope and must globalize it even though it is not actually necessary. +Fortunately, this optimization can undo this by looking at its usage. + +.. code-block:: c++ + + void use(int *x) { } + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt + omp110.cpp:6:7: remark: Moving globalized variable to the stack. [OMP110] + int x; + ^ + +A less trivial example can be seen using C++'s complex numbers. In this case the +overloaded arithmetic operators cause pointers to the complex numbers to escape +the current scope, but they can again be removed once the usage is visible. + +.. code-block:: c++ + + #include + + using complex = std::complex; + + void zaxpy(complex *X, complex *Y, const complex D, int N) { + #pragma omp target teams distribute parallel for firstprivate(D) + for (int i = 0; i < N; ++i) + Y[i] = D * X[i] + Y[i]; + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt + In file included from omp110.cpp:1: + In file included from /usr/bin/clang/lib/clang/13.0.0/include/openmp_wrappers/complex:27: + /usr/include/c++/8/complex:328:20: remark: Moving globalized variable to the stack. [OMP110] + complex<_Tp> __r = __x; + ^ + /usr/include/c++/8/complex:388:20: remark: Moving globalized variable to the stack. [OMP110] + complex<_Tp> __r = __x; + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading optimization remark. diff --git a/openmp/docs/remarks/OMP111.rst b/openmp/docs/remarks/OMP111.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP111.rst @@ -0,0 +1,66 @@ +Replaced globalized variable with X bytes of shared memory. [OMP111] +==================================================================== + +.. _omp111: + +This optimization occurs when a globalized variable's data is shared between +multiple threads, but requires a static amount of memory that can be determined +at compile time. This is the case when only a single thread creates the memory +and then shares is between every thread. The memory can then be pushed to a +static buffer of shared memory on the device. This optimization allows users to +declare shared memory on the device without using OpenMP's allocators. + +Globalization normally occurs when a pointer to a thread-local variables escapes +the current scope. If a single thread is responsible for creating and sharing +the data it can instead be mapped directly to shared memory on the target +device. Checking if only a single thread can execute an instruction requires +that the parent functions have internal linkage. Otherwise, an external caller +could invalidate this analysis but having multiple threads call that function. +The optimization pass can automatically make internal copied of each function, +but it is still recommended to mark them as internal using keywords like +``static`` whenever possible. + +Example +------- + +This optimization should apply to any variable declared in an OpenMP target +region that is then shared with every thread in a parallel region. This allows +the user to declare shared memory without using custom allocators. A simple +stencil calculation shows how this can be used. + +.. code-block:: c++ + + void stencil(int M, int N, double *X, double *Y) { + #pragma omp target teams distribute collapse(2) \ + map(to : X [0:M * N]) map(tofrom : Y [0:M * N]) + for (int i0 = 0; i0 < M; i0 += MC) { + for (int j0 = 0; j0 < N; j0 += NC) { + double sX[MC][NC]; + + #pragma omp parallel for collapse(2) default(firstprivate) + for (int i1 = 0; i1 < MC; ++i1) + for (int j1 = 0; j1 < NC; ++j1) + sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)]; + + #pragma omp parallel for collapse(2) default(firstprivate) + for (int i1 = 1; i1 < MC - 1; ++i1) + for (int j1 = 1; j1 < NC - 1; ++j1) + Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] + + sX[i1][j1 + 1] + sX[i1][j1 - 1] + + -4.0 * sX[i1][j1]) / (dX * dX); + } + } + } + +.. code-block:: console + + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp + omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111] + double sX[MC][NC]; + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading optimization remark. diff --git a/openmp/docs/remarks/OMP112.rst b/openmp/docs/remarks/OMP112.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP112.rst @@ -0,0 +1,89 @@ +Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] +===================================================================================================== + +.. _omp112: + +This missed remark indicates that a globalized value was found on the target +device that was not either replaced with stack memory by :ref:`OMP110 ` +or shared memory by :ref:`OMP111 `. Globalization that has not been +removed will need to be handled by the runtime and will significantly hurt +performance. + +The OpenMP standard expects that threads can always share their data between +each-other. However, this is not true by default when offloading to a target +device such as a GPU. Threads on a GPU cannot shared their data unless it is +first placed in global or shared memory. In order to create standards complaint +code, the Clang compiler will globalize any variables that could potentially be +shared between the threads. In the majority of cases, globalized variables can +either be returns to a thread-local stack, or pushed to shared memory. However, +in a few cases it is necessary and will cause a performance penalty. + +Examples +-------- + +This example shows legitimate data sharing on the device. It is a convoluted +example, but is completely complaint with the OpenMP standard. If globalization +was not added this would result in different results on different target +devices. + +.. code-block:: c++ + + #include + #include + + #pragma omp declare target + static int *p; + #pragma omp end declare target + + void foo() { + int x = omp_get_thread_num(); + if (omp_get_thread_num() == 1) + p = &x; + + #pragma omp barrier + + printf ("Thread %d: %d\n", omp_get_thread_num(), *p); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp + omp112.cpp:9:7: remark: Found thread data sharing on the GPU. Expect degraded performance + due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] + int x = omp_get_thread_num(); + ^ + +A less convoluted example globalization that cannot be removed occurs when +calling functions that aren't visible from the current translation unit. + +.. code-block:: c++ + + extern void use(int *x); + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp + omp112.cpp:4:7: remark: Found thread data sharing on the GPU. Expect degraded performance + due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] + int x; + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading missed remark. diff --git a/openmp/docs/remarks/OMP113.rst b/openmp/docs/remarks/OMP113.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP113.rst @@ -0,0 +1,83 @@ +Could not move globalized variable to the stack. Variable is potentially captured in call. Mark parameter as `__attribute__((noescape))` to override. [OMP113] +============================================================================================================================================================== + +.. _omp113: + +This missed remark indicates that a globalized value could not be moved to the +stack because it is potentially captured by a call to an unknown function. In +order for a globalized variable to be moved to the stack, copies to its pointer +cannot be stored. Otherwise it is considered captured and could potentially be +shared between the threads. This can be overridden using a parameter level +attribute. + +Globalization will occur when a pointer to a thread-local variable escapes +the current scope. In most cases it can be determined that the variable cannot +be shared if a copy of its pointer is never made. However, this remark indicates +a copy of the variable either is present, or is possible because it is used +outside the current translation unit. + +Examples +-------- + +If a pointer to a thread-local variable is passed to a function not visible in +the current translation unit we need to assume a copy is made of it that can be +shared between the threads. This prevents :ref:`OMP110 ` from +triggering, which will result in a performance penalty when executing on the +target device. + +.. code-block:: c++ + + extern void use(int *x); + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-missed=openmp-opt omp113.cpp + missed.cpp:4:7: remark: Could not move globalized variable to the stack. Variable is + potentially captured in call. Mark parameter as `__attribute__((noescape))` to + override. [OMP113] + int x; + ^ + + + +As the remark suggests, this behaviour can be overridden using the ``noescape`` +attribute. This tells the compiler that no reference to the object the pointer +points to that is derived from the parameter value will survive after the +function returns. The user is responsible for verifying that this assertion is +correct. + +.. code-block:: c++ + + extern void use(__attribute__((noescape)) int *x); + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp113.cpp + missed.cpp:4:7: remark: Moving globalized variable to the stack. [OMP110] + int x; + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading missed remark. diff --git a/openmp/docs/remarks/OptimizationRemarks.rst b/openmp/docs/remarks/OptimizationRemarks.rst --- a/openmp/docs/remarks/OptimizationRemarks.rst +++ b/openmp/docs/remarks/OptimizationRemarks.rst @@ -15,41 +15,38 @@ `_ -.. _ompXXX: - -Some OpenMP remarks start with a "tag", like `[OMP100]`, which indicates that -there is further information about them on this page. To directly jump to the -respective entry, navigate to -`https://openmp.llvm.org/docs/remarks/OptimizationRemarks.html#ompXXX `_ where `XXX` is -the three digit code shown in the tag. - - ----- - - -.. _omp100: -.. _omp_no_external_caller_in_target_region: - -`[OMP100]` Potentially unknown OpenMP target region caller ----------------------------------------------------------- - -A function remark that indicates the function, when compiled for a GPU, is -potentially called from outside the translation unit. Note that a remark is -only issued if we tried to perform an optimization which would require us to -know all callers on the GPU. - -To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through -which the code that makes up the body of a parallel region is shared with the -threads in the team. Generally we use the address of the outlined parallel -region to identify the code that needs to be executed. If we know all target -regions that reach the parallel region we can avoid this function pointer -passing scheme and often improve the register usage on the GPU. However, If a -parallel region on the GPU is in a function with external linkage we may not -know all callers statically. If there are outside callers within target -regions, this remark is to be ignored. If there are no such callers, users can -modify the linkage and thereby help optimization with a `static` or -`__attribute__((internal))` function annotation. If changing the linkage is -impossible, e.g., because there are outside callers on the host, one can split -the function into an external visible interface which is not compiled for -the target and an internal implementation which is compiled for the target -and should be called from within the target region. +OpenMP Remarks +-------------- + +.. toctree:: + :hidden: + :maxdepth: 1 + + OMP100 + OMP110 + OMP111 + OMP112 + OMP113 + +.. list-table:: + :widths: 15 15 70 + :header-rows: 1 + + * - Diagnostics Number + - Diagnostics Kind + - Diagnostics Description + * - :ref:`OMP100 ` + - Analysis + - Potentially unknown OpenMP target region caller. + * - :ref:`OMP110 ` + - Optimization + - Moving globalized variable to the stack. + * - :ref:`OMP111 ` + - Optimization + - Replaced globalized variable with X bytes of shared memory. + * - :ref:`OMP112 ` + - Missed + - Found thread data sharing on the GPU. Expect degraded performance due to data globalization. + * - :ref:`OMP113 ` + - Missed + - Could not move globalized variable to the stack. Variable is potentially captured in call. Mark parameter as `__attribute__((noescape))` to override.