This patch enables balanced affinity on machines that do not have
hardware threads and/or do have cores clustered into packages. In facts,
balacing algorithm could be generalized for any arrangement with
at least two levels of hierarchy (depth > 1).
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
The patch as it is does not work for the case of multi package non-uniform topology.
So this case should be either fixed or disabled.
runtime/src/kmp_affinity.cpp | ||
---|---|---|
3994 ↗ | (On Diff #63996) | This is number of cores in the last package, as the code supposed to be run on single package topology only. |
3995 ↗ | (On Diff #63996) | This code does not take into account number of packages (supposed to be run on single-package topology). |
4017 ↗ | (On Diff #63996) | Here the memory gets overwritten when the last package has lesser number of cores than max available, because (core*nth_per_core+thread) can be bigger than nproc so that the write is out of array bound. |
4655 ↗ | (On Diff #63996) | This is the number of cores in the last package. Other packages are skipped (no threads bound there). |
There are ARMv8 boards where cores of one SoC are clustered into two packages: one for LITTLE cores and one for big. With original limitations this affinity method refused to work (due to nPackages > 1) which is bit discriminatory.
Note that I tested both uniform (equal number of cores per clusters) and non-uniform (more LITTLE cores than big) arrangements.
What I couldn't test (but I may gain access to such x86_64 machine in the future) is multi-package hyper-threaded machine (more than one CPU on board, each with more than one core of which each has more than one hardware thread) - I need to check again what effect my chagnes have on such arrangement (e.g. possible overwrites in procarr array).
runtime/src/kmp_affinity.cpp | ||
---|---|---|
3994 ↗ | (On Diff #63996) | The whole point of change in line 3967 is to lift the limitation that the code is supposed to be run on single package topology only. |
3995 ↗ | (On Diff #63996) | In case of nPackages > 1 and nCoresPerPkg, nth_per_core equals to nCoresPerPkg. |
4655 ↗ | (On Diff #63996) | Do you refer to the effect this has on line 4679? This code crashes when __kmp_ncores is used there instead. |
Maybe I was not clear enough. The code works fine for multi package non-uniform if no hyperthreads available, but works incorrectly for multi package non-uniform with hyperthreads. So the simplest fix would be disable balanced affinity for this case.
I haven't finished debugging no-HT case yet, I expect there should not be any problems, because the output of verbose affinity looks fine there.
Just tried to flag the found problem early...
runtime/src/kmp_affinity.cpp | ||
---|---|---|
4655 ↗ | (On Diff #63996) | My comment was probably irrelevant here. The problem was incorrect filling of the procarr array that only contained info on the last package. So if this array is in use (I mostly tested case on line 4710), then all threads were bound to this last package. If you fix memory overwriting at initialization, then anyway ncores will need to be multiplied by number of packages here or extra loop over packages needs to be added, I think. Otherwise l don't see how all machine can be covered, given that loops for( int i = 0; i < ncores; i++ ) only walk through subset of all cores now. BTW, I finished with debugging of no-HT case. It works fine because you effectively reduced this case to single package via shifting topology one level down and making the balanced affinity to see packages as cores and cores as threads. This trick does not work for HT enabled topology. |
I prepared and made use of three helper functions to make dealing with any case of non-uniform arrangement easier. I did not touch uniform topology code as it does not seem to be affected by existence of package layer: for HT case there are always the two deepest levels in play anyway. I wonder if we need this uniform - non-uniform distinction at all, uniform topology is just a special case of non-uniform topology and my new approach is capable to deal with both.
Actually, these are four helper functions, I didn't notice the smallest which only calls the other one.
LGTM
The code may still degrade to affinity none in some exotic cases of non-uniform topology (when address2os array has info on cores with hyperthreads numbered not from 0), but the value of code is much bigger than the need to support balanced affinity for such cases.