This is an archive of the discontinued LLVM Phabricator instance.

Add interface to compute number of physical cores on host system
ClosedPublic

Authored by tejohnson on Oct 13 2016, 8:24 AM.

Details

Summary

For now I have only added support for x86_64 Linux, but other systems
can be added incrementally.

This is to be used for setting the default parallelism for ThinLTO
backends (instead of thread::hardware_concurrency which includes
hyperthreading and is too aggressive). I'll send this as a follow-on
patch, and it will fall back to hardware_concurrency when the new
getHostNumPhysicalCores returns -1 (when not supported for a given
host system).

I also added an interface to MemoryBuffer to force reading a file
as a stream - this is required for /proc/cpuinfo which is a special
file that looks like a normal file but appears to have 0 size.
The existing readers of this file in Host.cpp are reading the first
1024 or so bytes from it, because the necessary info is near the top.
But for the new functionality we need to be able to read the entire
file. I can go back and change the other readers to use the new
getFileAsStream as a follow-on patch since it seems much more robust.

Added a unittest.

Event Timeline

tejohnson updated this revision to Diff 74526.Oct 13 2016, 8:24 AM
tejohnson retitled this revision from to Add interface to compute number of physical cores on host system.
tejohnson updated this object.
tejohnson added a reviewer: mehdi_amini.
tejohnson added a subscriber: llvm-commits.
mehdi_amini added inline comments.Oct 13 2016, 9:27 AM
lib/Support/Host.cpp
1197

I think we should read the file once:

static int computeHostNumPhysicalCores() {
  // all your stuff
}
int sys::getHostNumPhysicalCores() {
  static int NumCores = computeHostNumPhysicalCores();
  return NumCores;
}
1212

Not that it matters that much, but have you considered reading the file line by line instead of reading it as whole, splitting the lines, and then processing them in memory?

unittests/Support/Host.cpp
30

May want to check tuple <OS, arch>, but that can be changed when it'll be needed.

tejohnson added inline comments.Oct 13 2016, 10:09 AM
lib/Support/Host.cpp
1197

Good idea, done.

1212

I originally thought about that when I saw that Host.cpp already rolled its own support for reading in (part of) /proc/cpuinfo that didn't meet my needs. But I felt that using the existing memory buffer streaming support (already used if we detect this isn't a regular file, or for stdin) was cleaner and simpler.

unittests/Support/Host.cpp
30

Good idea - done. This also made me realize it should be an AND condition not an OR!

tejohnson updated this revision to Diff 74537.Oct 13 2016, 10:09 AM

Address review comments

mehdi_amini accepted this revision.Oct 13 2016, 10:24 AM
mehdi_amini edited edge metadata.

LGTM.

unittests/Support/Host.cpp
30

Yes that's what I had in mind :)

This revision is now accepted and ready to land.Oct 13 2016, 10:24 AM
aaron.ballman added inline comments.
lib/Support/Host.cpp
1239

No Windows implementation? Should be able to use GetLogicalProcessorInformation() pretty easily.

https://msdn.microsoft.com/en-us/library/windows/desktop/ms683194(v=vs.85).aspx

mehdi_amini added inline comments.Oct 13 2016, 10:30 AM
lib/Support/Host.cpp
1239

Patch welcome :)

I'll supply the MacOS version when possible.

tejohnson added inline comments.Oct 13 2016, 10:36 AM
lib/Support/Host.cpp
1239

I'd prefer someone who has access to Windows for pre-commit testing to contribute a patch. I'd have to guess and just wait for bots to test. (Mehdi had already committed to supplying the Mac OS side patch.)

This revision was automatically updated to reflect the committed changes.

Why specifically asking about physical cores? Should this ask instead about the preferred number of concurrent tasks?

Why specifically asking about physical cores? Should this ask instead about the preferred number of concurrent tasks?

The reason is that if you have for instance very memory heavy tasks, and you know (by experiment) that you don't benefit much from hyper-threading in terms of runtime, it seems more friendly to kick less threads (divide memory peak by 2).

Why specifically asking about physical cores? Should this ask instead about the preferred number of concurrent tasks?

Note also, we already have llvm::thread::hardware_concurrency() that returns the "concurrent tasks" (i.e. virtual cores on machines with hyper-threading).

The reason is that if you have for instance very memory heavy tasks, and you know (by experiment) that you don't benefit much from hyper-threading in terms of runtime, it seems more friendly to kick less threads (divide memory peak by 2).

Then what's stopping us from returning that number as "preferred number of concurrent tasks"? Instead we have something motivated by x86 that may or may not work well on other architectures.

The reason is that if you have for instance very memory heavy tasks, and you know (by experiment) that you don't benefit much from hyper-threading in terms of runtime, it seems more friendly to kick less threads (divide memory peak by 2).

Then what's stopping us from returning that number as "preferred number of concurrent tasks"?

Do you mean having llvm::thread::hardware_concurrency()returning the "preferred number of concurrent tasks"?
This does not seem a good idea to me, because there was an important "if" in the explanation above: "if you have for instance very memory heavy tasks, and you know (by experiment) that you don't benefit much from hyper-threading in terms of runtime".

Instead we have something motivated by x86 that may or may not work well on other architectures.

I'm not sure how to solve it differently than delegating such logic to the client of this API, i.e. I see this API is a basic-block to build other APIs.
This other API could be for instance llvm::thread::hardware_coarse_concurrency(). This is where I would handle handle architecture specific choices (and default to hardware_concurrency).

I'm not sure how to solve it differently than delegating such logic to the client of this API, i.e. I see this API is a basic-block to build other APIs.
This other API could be for instance llvm::thread::hardware_coarse_concurrency(). This is where I would handle handle architecture specific choices (and default to hardware_concurrency).

The host will most often be x86, so this isn't really that much of a big deal. It just seemed like a cleaner idea would be to query the OS about how many concurrent tasks it could handle at the moment, and that could transparently take into account the current load on the machine, if someone wanted to go that far.