This is an archive of the discontinued LLVM Phabricator instance.

[Support] On Windows 11, fix an affinity mask issue on large core count machines
ClosedPublic

Authored by aganea on Nov 26 2022, 11:52 AM.

Details

Summary

Recent Windows 11 and Windows Server 2022 changed the way they assign 'processor groups' to a starting PE. Before Windows 11 and Windows Server2022, only one processor group was assigned by default, then the program was responsible for dispatching its own threads on more 'processor groups'. That is what D71775 was doing, allowing LLVM programs use all threads on many cores machines.

After Windows 11 and Windows Server 2022, the OS takes care of that. This has an adverse effect reported in PR56618 which is that using ::GetProcessAffinityMask() API in some edge cases seems buggy now. That API was used to detect if an affinity mask was set, and adjust accordingly the available threads for a ThreadPool.

From the looks of it this seems to be a TOCTOU bug, the OS assigns a default affinity mask on the process group where the PE is started, however later the PE's main() thread runs on a different-sized process group. On Windows, the max size for an affinity mask is 64. In our case, when running on a n2d-highcpu-224 GCE instance, we're seeing 4 processor groups, 2 of size 64 and 2 others of size 48, which makes a total 224 vCPUs. The Windows OS randomly assigns a starting process mask of either (2^64)-1 or (2^48)-1 bits. In some edge cases, the main thread calling ::GetProcessAffinityMask() randomly runs on a different process group, thus making hard for a program to determine if a custom affinity mask was set or not. This wasn't happening before Windows 11, since only a single 'processor group' was used on PE startup, even on machines with asymmetric processor groups.

With this patch, on one hand, on Windows 11 & Windows Server 2022 we disable manual dispatching of threads on processor groups, and instead let the Windows OS do that. On the other hand, a workaround was added to mitigate the ::GetProcessAffinityMask() issue described above (see Threading.inc, L226).

Fixes PR56618.

Diff Detail

Event Timeline

aganea created this revision.Nov 26 2022, 11:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 26 2022, 11:52 AM
aganea requested review of this revision.Nov 26 2022, 11:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 26 2022, 11:52 AM
tschuett added inline comments.
llvm/lib/Support/Windows/Process.inc
480–484

You can use std::option<RTL_OSVERSIONINFOEXW>.

502

This looks like cute way to model nullopt. Would returning std::option<lvm::VersionTuple> be an improvement?

aganea updated this revision to Diff 478107.Nov 27 2022, 9:06 AM
aganea marked 2 inline comments as done.

As suggested by @tschuett.

llvm/lib/Support/Windows/Process.inc
502

I made the changes so you can see how it looks from the caller perspective. I think std::option internally in Process.inc makes sense. However externally, returning std::option from llvm::GetWindowsOSVersion() leaks the fact that the Win32 APIs can fail, which clients don't really care about, and requires a bit more client-side code. Semantically clients only need to compare < or > against a version number, and 0 is fine. Please let me know your thoughts.

tschuett added a comment.EditedNov 27 2022, 9:56 AM

It is completely up to you.

But it is kind of dangerous to return 0.

Thinking about this again. Let's say you are running on victious Windows 5000 machine, fail to detect the version, and return 0. All version comparisons are bogus. I would prefer to return an optional to communicate that this API is fallible.

rnk resigned from this revision.Nov 28 2022, 10:17 AM

I'm preparing for leave and don't have much time to review, so I'm not likely to get to this on a helpful timeline.

aaron.ballman added inline comments.
llvm/lib/Support/Windows/Process.inc
485–504

RtlGetVersion came with Windows 2000, which is older than the oldest version of Windows we support. I think we can safely assert that this interface is loaded.

491–493

According to MSDN, this API only returns STATUS_SUCCESS (https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-rtlgetversion), so I wonder if we can get away with an assert here as well, and then we don't have to worry about this interface ever resulting in no valid version information, which can propagate to the callers.

502

I think std::option internally in Process.inc makes sense.

Agreed, I think it's an improvement.

However externally, returning std::option from llvm::GetWindowsOSVersion() leaks the fact that the Win32 APIs can fail, which clients don't really care about, and requires a bit more client-side code. Semantically clients only need to compare < or > against a version number, and 0 is fine. Please let me know your thoughts.

I agree; the very first thing I did on this review was add a comment to that declaration asking for comments explaining when the result might not exist and how callers should react in that case. I'm not 100% sure we can remove the optionality from the return type, but I found some hints that suggest we can explore it.

Our local testing with this patch shows that it solves our issue. Thanks very much! Looking forward to it landing.

aganea updated this revision to Diff 479845.Dec 3 2022, 8:32 AM
aganea marked 3 inline comments as done.

Address comments.

thieta added a comment.Dec 6 2022, 8:36 AM

Is there any point in adding any tests here that can be useful? Probably not since we are working with the Windows API right? I don't have much to comment on, but it seems reasonable if it solves the problem.

Is there any point in adding any tests here that can be useful? Probably not since we are working with the Windows API right? I don't have much to comment on, but it seems reasonable if it solves the problem.

The best test is the usage of the ThreadPool. However: 1. it needs a significant enough CPU load to test all sockets (as in linking clang.exe). 2. it requires a VM with WinServer 2022 and at least two or more sockets/groups. and 3. even in that case, I don't have a good way to check that all processors (as in cores) are fully utilized. The OS is now free to do what it pleases to distribute the CPU load. Unless someone has something to suggest, I don't see how to test cover this change.

mehdi_amini resigned from this revision.Dec 19 2022, 6:18 AM

Thanks for the detailed description, and kudos for figuring this out.
I don’t feel qualified enough about windows to review this right now, it seems fine to me from what I gathered though.
Since this is a scheduling change I think it is fine to not have unit tests for this.

Nit: you have a typo in the description “one one hand”

aganea edited the summary of this revision. (Show Details)Dec 19 2022, 7:12 AM

+other Windows experts that perhaps can review this @hans @zero9178 @aaron.ballman @mstorsjo @aeubanks

aganea edited the summary of this revision. (Show Details)Dec 22 2022, 5:46 AM
aaron.ballman accepted this revision.Jan 4 2023, 11:07 AM

LGTM aside from a minor coding style issue; you should also add a release note about the bug fix. Thank you!

llvm/lib/Support/Windows/Threading.inc
164–176

Minor nits for coding style.

This revision is now accepted and ready to land.Jan 4 2023, 11:07 AM
This revision was automatically updated to reflect the committed changes.
aganea marked an inline comment as done.
aganea added a comment.Jan 7 2023, 8:49 AM

Thanks for reviewing @aaron.ballman! I applied the changes and updated the release notes in the commit.