This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] libomp cleanup: move fast allocation routines to kmp_tasking.cpp
AbandonedPublic

Authored by AndreyChurbanov on Feb 1 2021, 2:14 PM.

Details

Summary

Move internal memory cache routines from omp_alloc.cpp to kmp_tasking.cpp where they are mostly used.
This can give compiler more opportunities to optimize tasking code that uses allocations on a hot path.

Diff Detail

Event Timeline

AndreyChurbanov created this revision.Feb 1 2021, 2:14 PM
AndreyChurbanov requested review of this revision.Feb 1 2021, 2:14 PM

I don't know if this is the right direction. Placing code based on call profiles seems to break the idea of modularity. I mean, ___kmp_fast_allocate is now a "tasking" thing?
I didn't see a reply yet, what about LTO for the runtime?

AndreyChurbanov abandoned this revision.Feb 7 2021, 12:59 PM

I don't know if this is the right direction. Placing code based on call profiles seems to break the idea of modularity. I mean, ___kmp_fast_allocate is now a "tasking" thing?
I didn't see a reply yet, what about LTO for the runtime?

@jdoerfert, thanks for the hint. I've made some performance experiments on SpecOMP 2012 376.kdtree test (which was initial trigger of this patch), and the results showed the patch does give some performance on current library build, but hurts performance on lto build. Moreover, the performance gain disappear if I also apply diff from https://reviews.llvm.org/D95816 (named it patch2 in the following data).

Some digits (time in sec):
I used Intel 19 compiler + libomp on 2x Xeon Gold 6252 (48 cores, 48 threads used):
trunk - 401
trunk+patch - 395
trunk+patch2 - 395
trunk-lto - 385
trunk+patch-lto 387

Similar performance trend seen on other platforms I have for testing.

So given that library built with lto gives better performance, and this patch hurts it, I am abandoning it.

I don't know if this is the right direction. Placing code based on call profiles seems to break the idea of modularity. I mean, ___kmp_fast_allocate is now a "tasking" thing?
I didn't see a reply yet, what about LTO for the runtime?

@jdoerfert, thanks for the hint. I've made some performance experiments on SpecOMP 2012 376.kdtree test (which was initial trigger of this patch), and the results showed the patch does give some performance on current library build, but hurts performance on lto build. Moreover, the performance gain disappear if I also apply diff from https://reviews.llvm.org/D95816 (named it patch2 in the following data).

Some digits (time in sec):
I used Intel 19 compiler + libomp on 2x Xeon Gold 6252 (48 cores, 48 threads used):
trunk - 401
trunk+patch - 395
trunk+patch2 - 395
trunk-lto - 385
trunk+patch-lto 387

Similar performance trend seen on other platforms I have for testing.

So given that library built with lto gives better performance, and this patch hurts it, I am abandoning it.

Very interesting numbers!

Should we enable LTO for the runtime build by default (assuming the compiler + linker combo allow it)? I doubt the compile time hit is too bad, and the shown performance win would certainly be worth it.