This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][RFC] libomp: Introduce hardware assisted barrier support for A64FX
Needs ReviewPublic

Authored by t-msn on Mar 29 2022, 5:08 AM.

Details

Summary

Hello,
This is a RFC version. There are some rough parts and some case are not optimized,
but I'd like to hear if current approach is a right way to suport hardware specific barrier.

Some descriptions are as below:

  • Add new barrier type 'hard' which performs hardware assisted barrier. Currently this only works for A64FX processor on Linux.
    • To use hard barrier, all barrier patterns must use 'hard' i.e. KMP_FORKJOIN_BARRIER_PATTERN=hard,hard KMP_PLAIN_BARRIER_PATTERN=hard,hard KMP_REDUCTION_BARRIER_PATTERN=hard,hard
    • To use hard barrier, hardware barrier driver needs to be loaded in the system. User interface to driver is NOT stable at this point. Current driver is: https://github.com/t-msn/hwb_driver_oot/tree/version-20220329
      • Current driver uses sysfs to setup hardware barrier which is opened in libomp code directly. I adopt this to avoid library dependency in libomp but now I feel it is better to offload user-kernel interaction details to runtime library and libomp just dlopens the library.
  • No restrictions in openmp syntax but it is required each thread runs in each core in succession. Basically this means "OMP_PLACES=threads OMP_PROC_BIND=close" is used or affinity is set in this way in parallell clause
  • Due to hardware restriction, hardware barrier only synchornizes within a group (L2-share domain). When team's threads cross group boundry, hybrid barrier scheme is deployed. That is barrier has hierarchical structure and software barrier is used for inter-group barrier and hardware barrier for intra-group barrier
  • Even when a system supports hardware barrier, whether hard barrier can be used or not is determined per team. If the team cannot use hard barrier for some reason, software barrier is used for the team
  • If there is no task in application code and only intra-group barrier is used, KMP_TASKING=0 can be used to speedup barrier operation
  • As implementation details, thread handling is basically the same as distribution barrier as both barrier requires reconfiguration when number of threads to be used is changed
  • I tested almost all unit tests passes with hard barrier by default

(This means: env LIT_OPTS="--show-unsupported --show-xfail -j 1" LIBOMP_TEST_ENV="KMP_PLAIN_BARRIER_PATTERN=hard,hard KMP_FORKJOIN_BARRIER_PATTERN=hard,hard KMP_REDUCTION_BARRIER_PATTERN=hard,hard OMP_PLACES=threads OMP_PROC_BIND=close" ninja check-openmp
with patch: https://reviews.llvm.org/D122645 . Due to hardware resource limit, tests cannot run in parallel)

Diff Detail

Event Timeline

t-msn created this revision.Mar 29 2022, 5:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2022, 5:08 AM
t-msn edited the summary of this revision. (Show Details)Mar 29 2022, 5:24 AM
t-msn edited the summary of this revision. (Show Details)Mar 29 2022, 5:29 AM
t-msn added a comment.Mar 29 2022, 5:41 AM

For reference, this is the result of EPCC openMP micro benchmark (syncbench'overhead [us]) on A64FX (Linux: 5.17).
(OMP_PLACES=threads OMP_PROC_BIND=close KMP_TASKING=0, OMP_NUM_THREADS is 12 or 48)

12thr48thr
hyperhardhyperhard
PARALLEL3.651.756.273.44
FOR2.030.284.851.94
PARALLEL FOR3.811.816.363.50
BARRIER1.960.234.861.90
SINGLE1.930.774.752.30
CRITICAL0.490.480.990.96
LOCK/UNLOCK0.540.541.031.03
ATOMIC0.690.682.362.37
REDUCTION6.032.7311.776.75
t-msn published this revision for review.Mar 29 2022, 5:48 AM
t-msn added a reviewer: AndreyChurbanov.
t-msn added a project: Restricted Project.
t-msn edited the summary of this revision. (Show Details)Mar 29 2022, 5:34 PM
t-msn edited the summary of this revision. (Show Details)Mar 29 2022, 5:37 PM

Sorry, I wrote wrong location for kernel driver. I updated summary to correct one.

t-msn added a comment.Mar 30 2022, 1:18 AM

Obviously I didn't add proper ifdef close and compile on non-arm64 architecture fails for current version. I will fix that.