Hello,
This is a RFC version. There are some rough parts and some case are not optimized,
but I'd like to hear if current approach is a right way to suport hardware specific barrier.
Some descriptions are as below:
- Add new barrier type 'hard' which performs hardware assisted barrier. Currently this only works for A64FX processor on Linux.
- To use hard barrier, all barrier patterns must use 'hard' i.e. KMP_FORKJOIN_BARRIER_PATTERN=hard,hard KMP_PLAIN_BARRIER_PATTERN=hard,hard KMP_REDUCTION_BARRIER_PATTERN=hard,hard
- To use hard barrier, hardware barrier driver needs to be loaded in the system. User interface to driver is NOT stable at this point. Current driver is: https://github.com/t-msn/hwb_driver_oot/tree/version-20220329
- Current driver uses sysfs to setup hardware barrier which is opened in libomp code directly. I adopt this to avoid library dependency in libomp but now I feel it is better to offload user-kernel interaction details to runtime library and libomp just dlopens the library.
- No restrictions in openmp syntax but it is required each thread runs in each core in succession. Basically this means "OMP_PLACES=threads OMP_PROC_BIND=close" is used or affinity is set in this way in parallell clause
- Due to hardware restriction, hardware barrier only synchornizes within a group (L2-share domain). When team's threads cross group boundry, hybrid barrier scheme is deployed. That is barrier has hierarchical structure and software barrier is used for inter-group barrier and hardware barrier for intra-group barrier
- Even when a system supports hardware barrier, whether hard barrier can be used or not is determined per team. If the team cannot use hard barrier for some reason, software barrier is used for the team
- If there is no task in application code and only intra-group barrier is used, KMP_TASKING=0 can be used to speedup barrier operation
- As implementation details, thread handling is basically the same as distribution barrier as both barrier requires reconfiguration when number of threads to be used is changed
- I tested almost all unit tests passes with hard barrier by default
(This means: env LIT_OPTS="--show-unsupported --show-xfail -j 1" LIBOMP_TEST_ENV="KMP_PLAIN_BARRIER_PATTERN=hard,hard KMP_FORKJOIN_BARRIER_PATTERN=hard,hard KMP_REDUCTION_BARRIER_PATTERN=hard,hard OMP_PLACES=threads OMP_PROC_BIND=close" ninja check-openmp
with patch: https://reviews.llvm.org/D122645 . Due to hardware resource limit, tests cannot run in parallel)
- A64FX's hardware barrier is described in the following manual: https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Specification_HPC_Extension_v1_EN.pdf