This is an archive of the discontinued LLVM Phabricator instance.

RFC: [openmp] Don't assume a specific layout for alloca() in Windows arm64 __kmp_invoke_microtask
AbandonedPublic

Authored by mstorsjo on Nov 11 2022, 4:13 AM.

Details

Summary

The alloca + memcpy could be optimized out entirely by Clang,
since it wasn't referenced by anything outside of the function,
and had no visible effect.

And even if it wouldn't be optimized out, there's strictly no
guarantee that the buffer allocated by alloca is at the exact
stack frame boundary when calling the next function; the compiler
could easily have added extra padding around the allocation at the
bottom.

Change to using the same full switch as the fallback in z_Linux_util.cpp,
which supports up to 15 arguments. This fixes calls to microtasks with
more than 6 parameters in optimized builds with Clang. However, for
the build configurations where the alloca wasn't broken already, this
change does break the misc_bugs/many-microtask-args.c test - which
requires passing 17 arguments to the microtask.

This tries to fix the same issue as D137827 does, but fixing it in
pure C code. However this fix only works for a set, fixed maximum
number of arguments.

The same implementation is being used as fallback implementation in
z_Linux_util.cpp for architectures without an assembly implementation

  • where it also fails the misc_bugs/many-microtask-args.c test.

Diff Detail

Event Timeline

mstorsjo created this revision.Nov 11 2022, 4:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 11 2022, 4:13 AM
mstorsjo requested review of this revision.Nov 11 2022, 4:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 11 2022, 4:13 AM
Herald added a subscriber: sstefan1. · View Herald Transcript
pulidocr added inline comments.Nov 11 2022, 9:02 AM
openmp/runtime/src/z_Windows_NT-586_util.cpp
152

Is there a fallback path for ARM64 MSVC? If there isn't, then our libomp140.aarch64.dll might break. @natgla FYI

natgla added inline comments.Nov 11 2022, 9:11 AM
openmp/runtime/src/z_Windows_NT-586_util.cpp
174

argbuf better be volatile void*, to prevent optimizer messing with it

mstorsjo added inline comments.Nov 11 2022, 9:17 AM
openmp/runtime/src/z_Windows_NT-586_util.cpp
152

No, there's no fallback path to that. So yes, this would break your libomp builds for the cases if passing more than 15 arguments here.

But the main point is that the alloca + memcpy construct is extremely brittle here - I wouldn't even be sure that an update of MSVC could break it (since if you look at it from a compiler point of view, the memcpy is a dead write to a buffer which is not referenced anywhere).

As far as I know, the only properly robust solution, for an arbitrary number of arguments, is to write this function in assembly (like I do with gnu assembly in D137827). I guess it's possible to port the existing aarch assembly function to MS armasm64 syntax too (but I don't know if cmake knows how to interact with that tool).

174

Yes, possibly. With Clang I managed to make it not optimize it out, by adding e.g. __asm__ __ volatile__("" ::"r"(argbuf)), which passes the buffer pointer to an opaque inline assembly snippet (so the compiler can't assume anything about it) which does nothing.

But even then, there's no strict guarantee that the buffer actually is at the exact bottom of the stack frame - there could concievably be some extra space at the bottom of the frame (e.g. if the compiler is saving space for outgoing parameters, if there is a function call with more than 8 parameters somewhere).

FWIW, We should land the patch that emits all arguments in a single struct. Then deprecate the varargs function. It is not helpful and causes too many problems to be kept around.

mstorsjo added inline comments.Nov 14 2022, 5:13 AM
openmp/runtime/src/z_Windows_NT-586_util.cpp
174

FWIW, changing the buffer to volatile void * doesn't work as such here:

../runtime/src/z_Windows_NT-586_util.cpp:173:5: error: no matching function for call to 'memcpy'
    memcpy(argbuf, &p_argv[6], len);
    ^~~~~~
/home/martin/clang-nightly/aarch64-w64-mingw32/include/wchar.h:1476:17: note: candidate function not viable: 1st argument ('volatile void *') would lose volatile qualifier
  void *__cdecl memcpy(void * __restrict__ _Dst,const void * __restrict__ _Src,size_t _MaxCount) __MINGW_ATTRIB_DEPRECATED_SEC_WARN;
                ^

(And if casting the volatile pointer back to void*, there's no effect from that.)

natgla added inline comments.Nov 15 2022, 9:51 AM
openmp/runtime/src/z_Windows_NT-586_util.cpp
152

@mstorsjo,
For now please keep the current implementation for MSVC - or re-write it in assembler. Yes, it's brittle, it will be great to switch to an array of arguments instead of using varargs, but we don't have resources now. Currently it's working. It may break if there will be changes in __kmp_invoke_microtask or if MSVC compiler will get better with optimizations. We have tests to ensure we won't break it in Microsoft, and if we'll do, we'll have to re-write this function. And we plan to add our testing bot to LLVM.

@mstorsjo, sorry, didn't ask you earlier: what is your environment? Our combination: everything is on Windows; we are using MSVC compiler to compile OpenMP code (for x64, x86, or arm64), and using libomp140 for runtime. LLVM libomp runtime is compiled by MSVC (for x64, x86, arm64) with additional definition 'OPENMP_MSVC_NAME_SCHEME' for cmake. We also have LLVM_ENABLE_PDB=ON for cmake for release, AFAIK it's different from usual.

@mstorsjo, sorry, didn't ask you earlier: what is your environment? Our combination: everything is on Windows; we are using MSVC compiler to compile OpenMP code (for x64, x86, or arm64), and using libomp140 for runtime. LLVM libomp runtime is compiled by MSVC (for x64, x86, arm64) with additional definition 'OPENMP_MSVC_NAME_SCHEME' for cmake. We also have LLVM_ENABLE_PDB=ON for cmake for release, AFAIK it's different from usual.

I build libomp with Clang, for i686, x86_64 and aarch64 (in mingw mode, but I test some aspects occasionally with building with clang in MSVC mode too).

I then run the openmp lit testsuite on Windows on arm64 (and on x86_64 too).

openmp/runtime/src/z_Windows_NT-586_util.cpp
152

Ok, I’ll abandon this patch then. D137827 replaces this function with an assembler implementation - but the implementation is gnu assembly, so it works with clang, but not with MSVC. (That patch keeps using this codepath for MSVC.)

mstorsjo abandoned this revision.Nov 16 2022, 12:39 AM

Abandoning this change for now.

Currently, microtasks with more than 6 arguments are broken when libomp is built with Clang for windows/arm64, while they work correctly (for any arbitrary number of arguments) when built with MSVC for the same configuration.

This patch improved things for Clang (for up to 15 arguments), while it also would have regressed the MSVC configuration down to only working for up to 15 arguments.

For Clang, D137827 fixes the issue for any arbitrary number of arguments, by providing the same function implemented in gnu assembly.