Page MenuHomePhabricator

[OpenMP] libomp: minor changes to improve library performance
ClosedPublic

Authored by AndreyChurbanov on Feb 1 2021, 1:25 PM.

Details

Summary

Three minor changes in this patch:

  • added UNLIKELY hint to few rarely executed branches;
  • replaced couple of run time checks with debug assertions;
  • moved check of presence of ittnotify tool from inside the function call.

These three changes incrementally improve library performance on SpecOMP2012 376.kdtree test to 1.5% in total
(used Intel 19 compiler + libomp on 2x24-core Intel CLX system).

Diff Detail

Unit TestsFailed

TimeTest
150 msx64 debian > Clang.CodeGen::builtin-align-assumption.c
Script: -- : 'RUN: at line 3'; /mnt/disks/ssd0/agent/llvm-project/build/bin/clang -cc1 -internal-isystem /mnt/disks/ssd0/agent/llvm-project/build/lib/clang/13.0.0/include -nostdsysteminc -triple=x86_64-unknown-unknown /mnt/disks/ssd0/agent/llvm-project/clang/test/CodeGen/builtin-align-assumption.c -emit-llvm -O1 -o - | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/clang/test/CodeGen/builtin-align-assumption.c
140 msx64 windows > Clang.CodeGen::builtin-align-assumption.c
Script: -- : 'RUN: at line 3'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\clang.exe -cc1 -internal-isystem c:\ws\w16-1\llvm-project\premerge-checks\build\lib\clang\13.0.0\include -nostdsysteminc -triple=x86_64-unknown-unknown C:\ws\w16-1\llvm-project\premerge-checks\clang\test\CodeGen\builtin-align-assumption.c -emit-llvm -O1 -o - | c:\ws\w16-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16-1\llvm-project\premerge-checks\clang\test\CodeGen\builtin-align-assumption.c

Event Timeline

AndreyChurbanov created this revision.Feb 1 2021, 1:25 PM
AndreyChurbanov requested review of this revision.Feb 1 2021, 1:25 PM

UNLIKELY and assert seem ok, the third change not so much.

openmp/runtime/src/kmp_tasking.cpp
1926

This seems very unfortunate TBH. Now we duplicate the function body 3 times for some minor speedup somewhere? That is a slippery slope. What about a header implementation? Building the runtime with LTO?

Removed "if (UNLIKELY(itt_sync_create_ptr))" from kmp_itt_taskwait_object(), just to clarify previous change its usage.

Is it necessary to manually inline __kmp_itt_taskwait_object three times? The rest of the changes look good to me.

Is it necessary to manually inline __kmp_itt_taskwait_object three times? The rest of the changes look good to me.

Not sure I got your comment right, will try to clarify a bit.
I've just submitted update that removes "if (UNLIKELY(itt_sync_create_ptr))" from kmp_itt_taskwait_object() function.
Previous change #3 of the patch was to copy the "if" from the function. Now, to make thing clearer, the change is to move the "if" from inside the function.
It is not the inlining, it is just move of single "if" outside of the function.

So the scheme of the change #3 is to replace "call + if + return + if" with just "if" on the hot path of 99+% of execution (all those don't use an ittnotify tool). This change gives performance gain in both cases of library build - with and without LTO.

I named change1 the changes with "UNLIKELY" hint (except the last three); change2 = change1 + debug assertions; change3 = change2 + move of "if (UNLIKELY(__itt_sync_create_ptr))" in the following data:
Median time in sec of 3 runs of SpecOMP 2012 376.kdtree benchmark on 2x Xeon Gold 6252 (48 cores, 48 threads used):
trunk - 401
change1 - 399
change2 - 397
change3 - 395
trunk-LTO - 385
change1-LTO - 381
change2-LTO - 373
change3-LTO - 371

Similar performance trend seen on other platforms I have for testing (tried various families of Intel Xeon).

So all three changes in this patch give extra performance for both builds - with and without LTO.

Eventually I think we should enable LTO build, as it gives extra performance with tiny binary size increase - less than 1%. Link time increases, but still takes seconds.
I used trunk clang with -DLIBOMP_CXXFLAGS=-flto -DLIBOMP_LDFLAGS="-flto -fuse-ld=lld" cmake build options, and would prefer somebody more familiar with cmake do actual LTO enabling in cmake file(s).

Granted, it's not inlining but partial inlining (the conditional from the 6 statement function).
I'm not happy about that either but I guess we can dot that. However, given that the
conditional is gone inside the function there is now an implicit calling invariant the user
has to know about. I don't use ITT but I feel this is not great. Two potential ways to
remedy this, neither is required to land the patch but they are simply suggestions:

  1. Keep the conditional in the function. It will introduce minimal cost in ITT runs but none otherwise.
  2. Create a wrapper in the header which encapsulates the __kmp_itt_taskwait_object call with the conditional exposed. This way we don't need to manually duplicate it in three locations.

ITT folks should accept the patch, no blocking objections from my side.


I agree, the CMake LTO stuff is unrelated and can be tackled independently.

From my perspective, compiler support for final mergeable tasks would have a more significant impact for applications like kdtree. There is a broad misconception among OpenMP programmers what if(0) means/does.

The kdtree app could use if(A) final(a) mergeable in all cases which have if(A). Unfortunately, no compiler provides sensible support for final mergeable tasks. Even just adding mergeable to the if does not avoid the task creation with current compilers. Avoiding task creation could cut down the runtime by more than 50%. Any recursive tasking application with serial cut-off could profit from sensible compiler support for mergeable.

From my perspective, compiler support for final mergeable tasks would have a more significant impact for applications like kdtree. There is a broad misconception among OpenMP programmers what if(0) means/does.

The kdtree app could use if(A) final(a) mergeable in all cases which have if(A). Unfortunately, no compiler provides sensible support for final mergeable tasks. Even just adding mergeable to the if does not avoid the task creation with current compilers. Avoiding task creation could cut down the runtime by more than 50%. Any recursive tasking application with serial cut-off could profit from sensible compiler support for mergeable.

That seems unrelated (to some degree), let's talk about that via email ;)

Tried to address comments:
Restored the check in __kmp_itt_taskwait_object() just in case it could be used somewhere else in future.
Wrapped repeated code fragments into macros to reduce code duplication.

jdoerfert accepted this revision.Feb 9 2021, 3:05 PM

LGTM

openmp/runtime/src/kmp_itt.h
103

Is there a reason to prefer macros over inline linkage functions?

This revision is now accepted and ready to land.Feb 9 2021, 3:05 PM
openmp/runtime/src/kmp_itt.h
103

Using functions would need to make UNLIKELY macro visible here. The macro is defined in ompt-internal.h which does not look elegant to be included here. Duplicating the macro definition is also not a good solution. Moving the definition to some "neutral" place needs efforts, might be implemented separately if needed.

Hopefully ITT will eventually be removed from the code as non-standard profiling technique. But this may take (many) years until OMPT gets all the features ITT has, and then profilers (like Intel VTune) can shift from ITT to OMPT as it is a standard technique.

This revision was landed with ongoing or failed builds.Feb 11 2021, 1:43 PM
This revision was automatically updated to reflect the committed changes.