Page MenuHomePhabricator

Make balanced affinity work on AArch64 (and possibly other architectures too)

Authored by pawosm01 on Jul 14 2016, 9:34 AM.



This patch enables balanced affinity on machines that do not have
hardware threads and/or do have cores clustered into packages. In facts,
balacing algorithm could be generalized for any arrangement with
at least two levels of hierarchy (depth > 1).

Diff Detail

Event Timeline

pawosm01 updated this revision to Diff 63992.Jul 14 2016, 9:34 AM
pawosm01 retitled this revision from to Make balanced affinity work on AArch64 (and possibly other architectures too).
pawosm01 updated this object.
pawosm01 added reviewers: jlpeyton, AndreyChurbanov.
pawosm01 set the repository for this revision to rL LLVM.
pawosm01 added a subscriber: openmp-commits.
jlpeyton edited edge metadata.Jul 14 2016, 9:40 AM

Can you re-upload the patch with all context?

pawosm01 updated this revision to Diff 63996.Jul 14 2016, 9:47 AM
pawosm01 edited edge metadata.
AndreyChurbanov requested changes to this revision.Jul 21 2016, 9:23 AM
AndreyChurbanov edited edge metadata.

The patch as it is does not work for the case of multi package non-uniform topology.

So this case should be either fixed or disabled.

3994 ↗(On Diff #65319)

This is number of cores in the last package, as the code supposed to be run on single package topology only.

3995 ↗(On Diff #65319)

This code does not take into account number of packages (supposed to be run on single-package topology).

4017 ↗(On Diff #65319)

Here the memory gets overwritten when the last package has lesser number of cores than max available, because (core*nth_per_core+thread) can be bigger than nproc so that the write is out of array bound.

4655 ↗(On Diff #65319)

This is the number of cores in the last package. Other packages are skipped (no threads bound there).

This revision now requires changes to proceed.Jul 21 2016, 9:23 AM

Seems like we wrote our comments at the same time. I'll look at the issues raised.

pawosm01 added a comment.EditedJul 21 2016, 10:23 AM

There are ARMv8 boards where cores of one SoC are clustered into two packages: one for LITTLE cores and one for big. With original limitations this affinity method refused to work (due to nPackages > 1) which is bit discriminatory.

Note that I tested both uniform (equal number of cores per clusters) and non-uniform (more LITTLE cores than big) arrangements.

What I couldn't test (but I may gain access to such x86_64 machine in the future) is multi-package hyper-threaded machine (more than one CPU on board, each with more than one core of which each has more than one hardware thread) - I need to check again what effect my chagnes have on such arrangement (e.g. possible overwrites in procarr array).

3994 ↗(On Diff #65319)

The whole point of change in line 3967 is to lift the limitation that the code is supposed to be run on single package topology only.

3995 ↗(On Diff #65319)

In case of nPackages > 1 and nCoresPerPkg, nth_per_core equals to nCoresPerPkg.
I did not change the algorithm, I didn't change names of the variables used either.

4655 ↗(On Diff #65319)

Do you refer to the effect this has on line 4679? This code crashes when __kmp_ncores is used there instead.

Maybe I was not clear enough. The code works fine for multi package non-uniform if no hyperthreads available, but works incorrectly for multi package non-uniform with hyperthreads. So the simplest fix would be disable balanced affinity for this case.

I haven't finished debugging no-HT case yet, I expect there should not be any problems, because the output of verbose affinity looks fine there.
Just tried to flag the found problem early...

Still, I need to fix it as I plan to have machine arranged like that.

4655 ↗(On Diff #65319)

My comment was probably irrelevant here. The problem was incorrect filling of the procarr array that only contained info on the last package. So if this array is in use (I mostly tested case on line 4710), then all threads were bound to this last package.

If you fix memory overwriting at initialization, then anyway ncores will need to be multiplied by number of packages here or extra loop over packages needs to be added, I think. Otherwise l don't see how all machine can be covered, given that loops for( int i = 0; i < ncores; i++ ) only walk through subset of all cores now.

BTW, I finished with debugging of no-HT case. It works fine because you effectively reduced this case to single package via shifting topology one level down and making the balanced affinity to see packages as cores and cores as threads. This trick does not work for HT enabled topology.

pawosm01 updated this revision to Diff 65319.Jul 25 2016, 3:00 AM
pawosm01 edited edge metadata.

I prepared and made use of three helper functions to make dealing with any case of non-uniform arrangement easier. I did not touch uniform topology code as it does not seem to be affected by existence of package layer: for HT case there are always the two deepest levels in play anyway. I wonder if we need this uniform - non-uniform distinction at all, uniform topology is just a special case of non-uniform topology and my new approach is capable to deal with both.

Actually, these are four helper functions, I didn't notice the smallest which only calls the other one.

AndreyChurbanov accepted this revision.Jul 29 2016, 10:32 AM
AndreyChurbanov edited edge metadata.


The code may still degrade to affinity none in some exotic cases of non-uniform topology (when address2os array has info on cores with hyperthreads numbered not from 0), but the value of code is much bigger than the need to support balanced affinity for such cases.

This revision is now accepted and ready to land.Jul 29 2016, 10:32 AM
This revision was automatically updated to reflect the committed changes.