diff --git a/openmp/docs/remarks/OMP100.rst b/openmp/docs/remarks/OMP100.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP100.rst @@ -0,0 +1,26 @@ +Potentially unknown OpenMP target region caller `[OMP100]` +========================================================== + +.. _omp100: +.. _omp_no_external_caller_in_target_region: + +A function remark that indicates the function, when compiled for a GPU, is +potentially called from outside the translation unit. Note that a remark is +only issued if we tried to perform an optimization which would require us to +know all callers on the GPU. + +To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through +which the code that makes up the body of a parallel region is shared with the +threads in the team. Generally we use the address of the outlined parallel +region to identify the code that needs to be executed. If we know all target +regions that reach the parallel region we can avoid this function pointer +passing scheme and often improve the register usage on the GPU. However, If a +parallel region on the GPU is in a function with external linkage we may not +know all callers statically. If there are outside callers within target +regions, this remark is to be ignored. If there are no such callers, users can +modify the linkage and thereby help optimization with a `static` or +`__attribute__((internal))` function annotation. If changing the linkage is +impossible, e.g., because there are outside callers on the host, one can split +the function into an external visible interface which is not compiled for +the target and an internal implementation which is compiled for the target +and should be called from within the target region. diff --git a/openmp/docs/remarks/OMP101.rst b/openmp/docs/remarks/OMP101.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP101.rst @@ -0,0 +1,6 @@ +Parallel region is used in unknown / unexpected ways. Will not attempt to rewrite the state machine. [OMP101] +============================================================================================================= + +.. _omp101: + +An analysis remark that indicates that a parallel region has unknown calls. diff --git a/openmp/docs/remarks/OMP102.rst b/openmp/docs/remarks/OMP102.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP102.rst @@ -0,0 +1,8 @@ +Parallel region is not called from a unique kernel. Will not attempt to rewrite the state machine. [OMP102] +=========================================================================================================== + +.. _omp102: + +This analysis remark indicates that a given parallel region is called by +multiple kernels. This prevents the compiler from optimizing it to a single +kernel and rewrite the state machine. diff --git a/openmp/docs/remarks/OMP110.rst b/openmp/docs/remarks/OMP110.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP110.rst @@ -0,0 +1,83 @@ +Moving globalized variable to the stack. [OMP110] +================================================= + +.. _omp110: + +This optimization remark indicates that a globalized variable was moved back to +thread-local stack memory on the device. This occurs when the optimization pass +can determine that a globalized variable is not possibly be shared between +threads and globalization was unnecessary. Using stack memory is the best-case +scenario for data globalization as the variable can now be stored in fast +register files on the device. This optimization requires full visibility of each +variable. + +Globalization typically occurs when a pointer to a thread-local variable escapes +the current scope. The compiler needs to be pessimistic and assume that the +pointer could be shared between multiple threads according to the OpenMP +standard. This is expensive on target offloading devices that do not allow +threads to share data by default. Instead, this data must be moved to memory +that can be shared, such as shared or global memory. This optimization moves the +data back from shared or global memory to thread-local stack memory if the data +is not actually shared between the threads. + +Examples +-------- + +A trivial example of globalization occurring can be seen with this example. The +compiler sees that a pointer to the thread-local variable ``x`` escapes the +current scope and must globalize it even though it is not actually necessary. +Fortunately, this optimization can undo this by looking at its usage. + +.. code-block:: c++ + + void use(int *x) { } + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt + omp110.cpp:6:7: remark: Moving globalized variable to the stack. [OMP110] + int x; + ^ + +A less trivial example can be seen using C++'s complex numbers. In this case the +overloaded arithmetic operators cause pointers to the complex numbers to escape +the current scope, but they can again be removed once the usage is visible. + +.. code-block:: c++ + + #include + + using complex = std::complex; + + void zaxpy(complex *X, complex *Y, const complex D, int N) { + #pragma omp target teams distribute parallel for firstprivate(D) + for (int i = 0; i < N; ++i) + Y[i] = D * X[i] + Y[i]; + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt + In file included from omp110.cpp:1: + In file included from /usr/bin/clang/lib/clang/13.0.0/include/openmp_wrappers/complex:27: + /usr/include/c++/8/complex:328:20: remark: Moving globalized variable to the stack. [OMP110] + complex<_Tp> __r = __x; + ^ + /usr/include/c++/8/complex:388:20: remark: Moving globalized variable to the stack. [OMP110] + complex<_Tp> __r = __x; + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading optimization remark. diff --git a/openmp/docs/remarks/OMP111.rst b/openmp/docs/remarks/OMP111.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP111.rst @@ -0,0 +1,66 @@ +Replaced globalized variable with X bytes of shared memory. [OMP111] +==================================================================== + +.. _omp111: + +This optimization occurs when a globalized variable's data is shared between +multiple threads, but requires a static amount of memory that can be determined +at compile time. This is the case when only a single thread creates the memory +and then shares is between every thread. The memory can then be pushed to a +static buffer of shared memory on the device. This optimization allows users to +declare shared memory on the device without using OpenMP's allocators. + +Globalization normally occurs when a pointer to a thread-local variables escapes +the current scope. If a single thread is responsible for creating and sharing +the data it can instead be mapped directly to shared memory on the target +device. Checking if only a single thread can execute an instruction requires +that the parent functions have internal linkage. Otherwise, an external caller +could invalidate this analysis but having multiple threads call that function. +The optimization pass can automatically make internal copied of each function, +but it is still recommended to mark them as internal using keywords like +``static`` whenever possible. + +Example +------- + +This optimization should apply to any variable declared in an OpenMP target +region that is then shared with every thread in a parallel region. This allows +the user to declare shared memory without using custom allocators. A simple +stencil calculation shows how this can be used. + +.. code-block:: c++ + + void stencil(int M, int N, double *X, double *Y) { + #pragma omp target teams distribute collapse(2) \ + map(to : X [0:M * N]) map(tofrom : Y [0:M * N]) + for (int i0 = 0; i0 < M; i0 += MC) { + for (int j0 = 0; j0 < N; j0 += NC) { + double sX[MC][NC]; + + #pragma omp parallel for collapse(2) default(firstprivate) + for (int i1 = 0; i1 < MC; ++i1) + for (int j1 = 0; j1 < NC; ++j1) + sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)]; + + #pragma omp parallel for collapse(2) default(firstprivate) + for (int i1 = 1; i1 < MC - 1; ++i1) + for (int j1 = 1; j1 < NC - 1; ++j1) + Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] + + sX[i1][j1 + 1] + sX[i1][j1 - 1] + + -4.0 * sX[i1][j1]) / (dX * dX); + } + } + } + +.. code-block:: console + + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp + omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111] + double sX[MC][NC]; + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading optimization remark. diff --git a/openmp/docs/remarks/OMP112.rst b/openmp/docs/remarks/OMP112.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP112.rst @@ -0,0 +1,89 @@ +Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] +===================================================================================================== + +.. _omp112: + +This missed remark indicates that a globalized value was found on the target +device that was not either replaced with stack memory by :ref:`OMP110 ` +or shared memory by :ref:`OMP111 `. Globalization that has not been +removed will need to be handled by the runtime and will significantly hurt +performance. + +The OpenMP standard expects that threads can always share their data between +each-other. However, this is not true by default when offloading to a target +device such as a GPU. Threads on a GPU cannot shared their data unless it is +first placed in global or shared memory. In order to create standards complaint +code, the Clang compiler will globalize any variables that could potentially be +shared between the threads. In the majority of cases, globalized variables can +either be returns to a thread-local stack, or pushed to shared memory. However, +in a few cases it is necessary and will cause a performance penalty. + +Examples +-------- + +This example shows legitimate data sharing on the device. It is a convoluted +example, but is completely complaint with the OpenMP standard. If globalization +was not added this would result in different results on different target +devices. + +.. code-block:: c++ + + #include + #include + + #pragma omp declare target + static int *p; + #pragma omp end declare target + + void foo() { + int x = omp_get_thread_num(); + if (omp_get_thread_num() == 1) + p = &x; + + #pragma omp barrier + + printf ("Thread %d: %d\n", omp_get_thread_num(), *p); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp + omp112.cpp:9:7: remark: Found thread data sharing on the GPU. Expect degraded performance + due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] + int x = omp_get_thread_num(); + ^ + +A less convoluted example globalization that cannot be removed occurs when +calling functions that aren't visible from the current translation unit. + +.. code-block:: c++ + + extern void use(int *x); + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp + omp112.cpp:4:7: remark: Found thread data sharing on the GPU. Expect degraded performance + due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] + int x; + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading missed remark. diff --git a/openmp/docs/remarks/OMP113.rst b/openmp/docs/remarks/OMP113.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP113.rst @@ -0,0 +1,81 @@ +Could not move globalized variable to the stack. Variable is potentially captured in call. Mark parameter as `__attribute__((noescape))` to override. [OMP113] +============================================================================================================================================================== + +.. _omp113: + +This missed remark indicates that a globalized value could not be moved to the +stack because it is potentially captured by a call to a function we cannot +analyze. In order for a globalized variable to be moved to the stack, copies to +its pointer cannot be stored. Otherwise it is considered captured and could +potentially be shared between the threads. This can be overridden using a +parameter level attribute as suggested in the remark text. + +Globalization will occur when a pointer to a thread-local variable escapes +the current scope. In most cases it can be determined that the variable cannot +be shared if a copy of its pointer is never made. However, this remark indicates +a copy of the variable either is present, or is possible because it is used +outside the current translation unit. + +Examples +-------- + +If a pointer to a thread-local variable is passed to a function not visible in +the current translation unit we need to assume a copy is made of it that can be +shared between the threads. This prevents :ref:`OMP110 ` from +triggering, which will result in a performance penalty when executing on the +target device. + +.. code-block:: c++ + + extern void use(int *x); + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-missed=openmp-opt omp113.cpp + missed.cpp:4:7: remark: Could not move globalized variable to the stack. Variable is + potentially captured in call. Mark parameter as `__attribute__((noescape))` to + override. [OMP113] + int x; + ^ + +As the remark suggests, this behaviour can be overridden using the ``noescape`` +attribute. This tells the compiler that no reference to the object the pointer +points to that is derived from the parameter value will survive after the +function returns. The user is responsible for verifying that this assertion is +correct. + +.. code-block:: c++ + + extern void use(__attribute__((noescape)) int *x); + + void foo() { + int x; + use(&x); + } + + int main() { + #pragma omp target parallel + foo(); + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp113.cpp + missed.cpp:4:7: remark: Moving globalized variable to the stack. [OMP110] + int x; + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading missed remark. diff --git a/openmp/docs/remarks/OMP120.rst b/openmp/docs/remarks/OMP120.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP120.rst @@ -0,0 +1,93 @@ +Transformed generic-mode kernel to SPMD-mode [OMP120] +===================================================== + +.. _omp120: + +This optimization remark indicates that the execution strategy for the OpenMP +target offloading kernel was changed. Generic-mode kernels execute by using a +single thread to schedule parallel worker threads using a state machine. This +code transformation can move a kernel that was initially generated in generic +mode to SPMD-mode where all threads are active at the same time with no state +machine. This is only possible if the instructions previously executed by a +single thread have no side-effects or can be guarded. If the instructions have +no side-effects they are simply recomputed by each thread. + +Generic-mode is often considerably slower than SPMD-mode because of the extra +overhead required to separately schedule worker threads and pass data between +them.This optimization allows users to use generic-mode semantics while +achieving the performance of SPMD-mode. This can be helpful when defining shared +memory between the threads using :ref:`OMP111 `. + +Examples +-------- + +Normally, any kernel that contains split OpenMP target and parallel regions will +be executed in generic-mode. Sometimes it is easier to use generic-mode +semantics to define shared memory, or more tightly control the distribution of +the threads. This shows a naive matrix-matrix multiplication that contains code +that will need to be guarded. + +.. code-block:: c++ + + void matmul(int M, int N, int K, double *A, double *B, double *C) { + #pragma omp target teams distribute collapse(2) \ + map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N]) + for (int i = 0; i < M; i++) { + for (int j = 0; j < N; j++) { + double sum = 0.0; + + #pragma omp parallel for reduction(+:sum) default(firstprivate) + for (int k = 0; k < K; k++) + sum += A[i*K + k] * B[k*N + j]; + + C[i*N + j] = sum; + } + } + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp + omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111] + double sum = 0.0; + ^ + omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120] + #pragma omp target teams distribute collapse(2) \ + ^ + +This requires guarding the store to the shared variable ``sum`` and the store to +the matrix ``C``. This can be thought of as generating the code below. + +.. code-block:: c++ + + void matmul(int M, int N, int K, double *A, double *B, double *C) { + #pragma omp target teams distribute collapse(2) \ + map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N]) + for (int i = 0; i < M; i++) { + for (int j = 0; j < N; j++) { + double sum; + #pragma omp parallel default(firstprivate) shared(sum) + { + #pragma omp barrier + if (omp_get_thread_num() == 0) + sum = 0.0; + #pragma omp barrier + + #pragma omp for reduction(+:sum) + for (int k = 0; k < K; k++) + sum += A[i*K + k] * B[k*N + j]; + + #pragma omp barrier + if (omp_get_thread_num() == 0) + C[i*N + j] = sum; + #pragma omp barrier + } + } + } + } + + +Diagnostic Scope +---------------- + +OpenMP target offloading optimization remark. diff --git a/openmp/docs/remarks/OMP121.rst b/openmp/docs/remarks/OMP121.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP121.rst @@ -0,0 +1,80 @@ +Value has potential side effects preventing SPMD-mode execution. Add `__attribute__((assume(\"ompx_spmd_amenable\")))` to the called function to override. [OMP121] +=================================================================================================================================================================== + +.. _omp121: + +This analysis remarks indicates that a potential side-effect that cannot be +guarded prevents the target region from executing in SPMD-mode. SPMD-mode +requires that each thread is active inside the region. Any instruction that +cannot be either recomputed by each thread independently or guarded and executed +by a single thread prevents the region from executing in SPMD-mode. + +This remark will attempt to print out the instructions preventing the region +from being executed in SPMD-mode. Calls to functions outside the current +translation unit will prevent this transformation from occurring as well, but +can be overridden using an assumption stating that it contains no calls that +prevent SPMD execution. + +Examples +-------- + +Calls to functions outside the current translation unit may contain instructions +or operations that cannot be executed in SPMD-mode. + +.. code-block:: c++ + + extern int work(); + + void use(int x); + + void foo() { + #pragma omp target teams + { + int x = work(); + #pragma omp parallel + use(x); + + } + } + + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp121.cpp + omp121.cpp:8:13: remark: Value has potential side effects preventing SPMD-mode + execution. Add `__attribute__((assume("ompx_spmd_amenable")))` to the called function + to override. [OMP121] + int x = work(); + ^ + +As the remark suggests, the problem is caused by the unknown call to the +external function ``work``. This can be overridden by asserting that it does not +contain any code that prevents SPMD-mode execution. + +.. code-block:: c++ + + __attribute__((assume("ompx_spmd_amenable"))) extern int work(); + + void use(int x); + + void foo() { + #pragma omp target teams + { + int x = work(); + #pragma omp parallel + use(x); + + } + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp121.cpp + omp121.cpp:6:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120] + #pragma omp target teams + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading analysis remark. diff --git a/openmp/docs/remarks/OMP130.rst b/openmp/docs/remarks/OMP130.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP130.rst @@ -0,0 +1,36 @@ +Removing unused state machine from generic-mode kernel. [OMP130] +================================================================ + +.. _omp130: + +This optimization remark indicates that an unused state machine was removed from +a target region. This occurs when there are no parallel regions inside of a +target construct. Normally, a state machine is required to schedule the threads +inside of a parallel region. If there are no parallel regions, the state machine +is unnecessary because there is only a single thread active at any time. + +Examples +-------- + +This optimization should occur on any target region that does not contain any +parallel work. + +.. code-block:: c++ + + void copy(int N, double *X, double *Y) { + #pragma omp target teams distribute map(tofrom: X[0:N]) map(tofrom: Y[0:N]) + for (int i = 0; i < N; ++i) + Y[i] = X[i]; + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp130.cpp + omp130.cpp:2:1: remark: Removing unused state machine from generic-mode kernel. [OMP130] + #pragma omp target teams distribute map(tofrom: X[0:N]) map(tofrom: Y[0:N]) + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading optimization remark. diff --git a/openmp/docs/remarks/OMP131.rst b/openmp/docs/remarks/OMP131.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP131.rst @@ -0,0 +1,45 @@ +Rewriting generic-mode kernel with a customized state machine. [OMP131] +======================================================================= + +.. _omp131: + +This optimization remark indicates that a generic-mode kernel on the device was +specialized for the given target region. When offloading in generic-mode, a +state machine is required to schedule the work between the parallel worker +threads. This optimization specializes the state machine in cases where there is +a known number of parallel regions inside the kernel. + +Examples +-------- + +This optimization should occur on any generic-mode kernel that has visibility on +all parallel regions, but cannot be moved to SPMD-mode. + +.. code-block:: c++ + + #pragma omp declare target + int TID; + #pragma omp end declare target + + void foo() { + #pragma omp target + { + TID = omp_get_thread_num(); + #pragma omp parallel + { + work(); + } + } + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp131.cpp + omp131.cpp:8:1: remark: Rewriting generic-mode kernel with a customized state machine. [OMP131] + #pragma omp target + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading optimization remark. diff --git a/openmp/docs/remarks/OMP132.rst b/openmp/docs/remarks/OMP132.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP132.rst @@ -0,0 +1,45 @@ +Generic-mode kernel is executed with a customized state machine that requires a fallback. [OMP132] +================================================================================================== + +.. _omp132: + +This analysis remark indicates that a state machine rewrite occurred, but +could not be done fully because of unknown calls to functions that may contain +parallel regions. The state machine handles scheduling work between parallel +worker threads on the device when operating in generic-mode. If there are +unknown parallel regions it prevents the optimization from fully rewriting the +state machine. + +Examples +-------- + +This will occur for any generic-mode kernel that may contain unknown parallel +regions. This is typically coupled with the :ref:`OMP133 ` remark. + +.. code-block:: c++ + + extern void setup(); + + void foo() { + #pragma omp target + { + setup(); + #pragma omp parallel + { + work(); + } + } + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp132.cpp + omp133.cpp:4:1: remark: Generic-mode kernel is executed with a customized state machine + that requires a fallback. [OMP132] + #pragma omp target + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading analysis remark. diff --git a/openmp/docs/remarks/OMP133.rst b/openmp/docs/remarks/OMP133.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP133.rst @@ -0,0 +1,70 @@ +Call may contain unknown parallel regions. Use `__attribute__((assume("omp_no_parallelism")))` to override. [OMP133] +==================================================================================================================== + +.. _omp133: + +This analysis remark identifies calls that prevented :ref:`OMP131 ` from +providing the generic-mode kernel with a fully specialized state machine. This +remark will identify each call that may contain unknown parallel regions that +caused the kernel to require a fallback. + +Examples +-------- + +This will occur for any generic-mode kernel that may contain unknown parallel +regions. This is typically coupled with the :ref:`OMP132 ` remark. + +.. code-block:: c++ + + extern void setup(); + + void foo() { + #pragma omp target + { + setup(); + #pragma omp parallel + { + work(); + } + } + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp133.cpp + omp133.cpp:6:5: remark: Call may contain unknown parallel regions. Use + `__attribute__((assume("omp_no_parallelism")))` to override. [OMP133] + setup(); + ^ + +The remark suggests marking the function with the assumption that it contains no +parallel regions. If this is done then the kernel will be rewritten with a fully +specialized state machine. + +.. code-block:: c++ + + __attribute__((assume("omp_no_parallelism"))) extern void setup(); + + + void foo() { + #pragma omp target + { + setup(); + #pragma omp parallel + { + work(); + } + } + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp133.cpp + omp133.cpp:4:1: remark: Rewriting generic-mode kernel with a customized state machine. [OMP131] + #pragma omp target + ^ + +Diagnostic Scope +---------------- + +OpenMP target offloading analysis remark. diff --git a/openmp/docs/remarks/OMP140.rst b/openmp/docs/remarks/OMP140.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP140.rst @@ -0,0 +1,49 @@ +Could not internalize function. Some optimizations may not be possible. [OMP140] +==================================================================================================================== + +.. _omp140: + +This analysis remark indicates that function internalization failed for the +given function. Internalization occurs when a call to a function that ordinarily +has external visibility is replaced with a call to a copy of that function with +only internal visibility. This allows the compiler to make strong static +assertions about the context a function is called in. Without internalization +this analysis would always be invalidated by the possibility of someone calling +the function in a different context outside of the current translation unit. +This is necessary for optimizations like :ref:`OMP111 ` and :ref:`OMP120 +`. If a function failed to be internalized it most likely has linkage +that cannot be copied. Internalization is currently only enabled by default for +OpenMP target offloading. + +Examples +-------- + +This will occur for any function declaration that has incompatible linkage. + +.. code-block:: c++ + + __attribute__((weak)) void setup(); + + void foo() { + #pragma omp target + { + setup(); + #pragma omp parallel + { + work(); + } + } + } + +.. code-block:: console + + $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-analysis=openmp-opt omp140.cpp + omp140.cpp:1:1: remark: Could not internalize function. Some optimizations may not + be possible. [OMP140] + __attribute__((weak)) void setup() { + ^ + +Diagnostic Scope +---------------- + +OpenMP analysis remark. diff --git a/openmp/docs/remarks/OMP150.rst b/openmp/docs/remarks/OMP150.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP150.rst @@ -0,0 +1,42 @@ +Parallel region merged with parallel region at . [OMP150] +=================================================================== + +.. _omp150: + +This optimization remark indicates that a parallel region was merged with others +into a single parallel region. Parallel region merging fuses consecutive +parallel regions to reduce the team activation overhead of forking and increases +the scope of possible OpenMP-specific optimizations within merged parallel +regions. This optimization can also guard sequential code between two parallel +regions if applicable. + +Example +------- + +This optimization should apply to any compatible and consecutive parallel +regions. In this case the sequential region between the parallel regions will be +guarded so it is only executed by a single thread in the new merged region. + +.. code-block:: c++ + + void foo() { + #pragma omp parallel + parallel_work(); + + sequential_work(); + + #pragma omp parallel + parallel_work(); + } + +.. code-block:: console + + $ clang++ -fopenmp -O2 -Rpass=openmp-opt -mllvm -openmp-opt-enable-merging omp150.cpp + omp150.cpp:2:1: remark: Parallel region merged with parallel region at merge.cpp:7:1. [OMP150] + #pragma omp parallel + ^ + +Diagnostic Scope +---------------- + +OpenMP optimization remark. diff --git a/openmp/docs/remarks/OMP160.rst b/openmp/docs/remarks/OMP160.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP160.rst @@ -0,0 +1,44 @@ +Removing parallel region with no side-effects. [OMP160] +======================================================= + +.. _omp160: + +This optimization remark indicates that a parallel region was deleted because it +was not found to have any side-effects. This can occur if the region does not +write any of its results to memory visible outside the region. This optimization +is necessary because the barrier between sequential and parallel code typically +prevents dead code elimination from completely removing the region. Otherwise +there will still be overhead to fork and merge the threads. + +Example +------- + +This optimization occurs whenever a parallel region was not found to have any +side-effects. This can occur if the parallel region only reads memory or is +simply empty. + +.. code-block:: c++ + + void foo() { + #pragma omp parallel + { } + #pragma omp parallel + { int x = 1; } + } + } + +.. code-block:: console + + $ clang++ -fopenmp -O2 -Rpass=openmp-opt omp160.cpp + omp160.cpp:4:1: remark: Removing parallel region with no side-effects. [OMP160] [-Rpass=openmp-opt] + #pragma omp parallel + ^ + delete.cpp:2:1: remark: Removing parallel region with no side-effects. [OMP160] [-Rpass=openmp-opt] + #pragma omp parallel + ^ + ^ + +Diagnostic Scope +---------------- + +OpenMP optimization remark. diff --git a/openmp/docs/remarks/OMP170.rst b/openmp/docs/remarks/OMP170.rst new file mode 100644 --- /dev/null +++ b/openmp/docs/remarks/OMP170.rst @@ -0,0 +1,41 @@ +OpenMP runtime call deduplicated. [OMP170] +==================================================================== + +.. _omp170: + +This optimization remark indicates that a call to an OpenMP runtime call was +replaced with the result of an existing one. This occurs when the compiler knows +that the result of a runtime call is immutable. Removing duplicate calls is done +by replacing all calls to that function with the result of the first call. This +cannot be done automatically by the compiler because the implementations of the +OpenMP runtime calls live in a separate library the compiler cannot see. + +Example +------- + +This optimization will trigger for known OpenMP runtime calls whose return value +will not change. + +.. code-block:: c++ + + void foo(int N) { + double *A = malloc(N * omp_get_thread_limit()); + double *B = malloc(N * omp_get_thread_limit()); + + #pragma omp parallel + work(&A[omp_get_thread_num() * N]); + #pragma omp parallel + work(&B[omp_get_thread_num() * N]); + } + +.. code-block:: console + + $ clang -fopenmp -O2 -Rpass=openmp-opt omp170.c + ompi170.c:2:26: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170] + double *A = malloc(N * omp_get_thread_limit()); + ^ + +Diagnostic Scope +---------------- + +OpenMP optimization remark. diff --git a/openmp/docs/remarks/OptimizationRemarks.rst b/openmp/docs/remarks/OptimizationRemarks.rst --- a/openmp/docs/remarks/OptimizationRemarks.rst +++ b/openmp/docs/remarks/OptimizationRemarks.rst @@ -15,41 +15,95 @@ `_ -.. _ompXXX: - -Some OpenMP remarks start with a "tag", like `[OMP100]`, which indicates that -there is further information about them on this page. To directly jump to the -respective entry, navigate to -`https://openmp.llvm.org/docs/remarks/OptimizationRemarks.html#ompXXX `_ where `XXX` is -the three digit code shown in the tag. - - ----- - - -.. _omp100: -.. _omp_no_external_caller_in_target_region: - -`[OMP100]` Potentially unknown OpenMP target region caller ----------------------------------------------------------- - -A function remark that indicates the function, when compiled for a GPU, is -potentially called from outside the translation unit. Note that a remark is -only issued if we tried to perform an optimization which would require us to -know all callers on the GPU. - -To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through -which the code that makes up the body of a parallel region is shared with the -threads in the team. Generally we use the address of the outlined parallel -region to identify the code that needs to be executed. If we know all target -regions that reach the parallel region we can avoid this function pointer -passing scheme and often improve the register usage on the GPU. However, If a -parallel region on the GPU is in a function with external linkage we may not -know all callers statically. If there are outside callers within target -regions, this remark is to be ignored. If there are no such callers, users can -modify the linkage and thereby help optimization with a `static` or -`__attribute__((internal))` function annotation. If changing the linkage is -impossible, e.g., because there are outside callers on the host, one can split -the function into an external visible interface which is not compiled for -the target and an internal implementation which is compiled for the target -and should be called from within the target region. +OpenMP Remarks +-------------- + +.. toctree:: + :hidden: + :maxdepth: 1 + + OMP100 + OMP101 + OMP102 + OMP110 + OMP111 + OMP112 + OMP113 + OMP120 + OMP121 + OMP130 + OMP131 + OMP132 + OMP133 + OMP140 + OMP150 + OMP160 + OMP170 + +.. list-table:: + :widths: 15 15 70 + :header-rows: 1 + + * - Diagnostics Number + - Diagnostics Kind + - Diagnostics Description + * - :ref:`OMP100 ` + - Analysis + - Potentially unknown OpenMP target region caller. + * - :ref:`OMP101 ` + - Analysis + - Parallel region is used in unknown / unexpected ways. Will not attempt to + rewrite the state machine. + * - :ref:`OMP102 ` + - Analysis + - Parallel region is not called from a unique kernel. Will not attempt to + rewrite the state machine. + * - :ref:`OMP110 ` + - Optimization + - Moving globalized variable to the stack. + * - :ref:`OMP111 ` + - Optimization + - Replaced globalized variable with X bytes of shared memory. + * - :ref:`OMP112 ` + - Missed + - Found thread data sharing on the GPU. Expect degraded performance due to + data globalization. + * - :ref:`OMP113 ` + - Missed + - Could not move globalized variable to the stack. Variable is potentially + captured in call. Mark parameter as `__attribute__((noescape))` to + override. + * - :ref:`OMP120 ` + - Optimization + - Transformed generic-mode kernel to SPMD-mode. + * - :ref:`OMP121 ` + - Analysis + - Value has potential side effects preventing SPMD-mode execution. Add + `__attribute__((assume(\"ompx_spmd_amenable\")))` to the called function + to override. + * - :ref:`OMP130 ` + - Optimization + - Removing unused state machine from generic-mode kernel. + * - :ref:`OMP131 ` + - Optimization + - Rewriting generic-mode kernel with a customized state machine. + * - :ref:`OMP132 ` + - Analysis + - Generic-mode kernel is executed with a customized state machine that + requires a fallback. + * - :ref:`OMP133 ` + - Analysis + - Call may contain unknown parallel regions. Use + `__attribute__((assume("omp_no_parallelism")))` to override. + * - :ref:`OMP140 ` + - Analysis + - Could not internalize function. Some optimizations may not be possible. + * - :ref:`OMP150 ` + - Optimization + - Parallel region merged with parallel region at . + * - :ref:`OMP160 ` + - Optimization + - Removing parallel region with no side-effects. + * - :ref:`OMP170 ` + - Optimization + - OpenMP runtime call deduplicated.