Page MenuHomePhabricator

[cmake] Don't build with -O3 -fPIC on Solaris/sparcv9
ClosedPublic

Authored by ro on Aug 10 2020, 12:40 AM.

Details

Summary

Tests on Solaris/sparcv9 currently show about 250 failures, most of them like the following:

FAIL: LLVM-Unit :: Support/./SupportTests/TaskQueueTest.UnOrderedFutures (4269 of 67884)
******************** TEST 'LLVM-Unit :: Support/./SupportTests/TaskQueueTest.UnOrderedFutures' FAILED ********************
Note: Google Test filter = TaskQueueTest.UnOrderedFutures
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TaskQueueTest
[ RUN      ] TaskQueueTest.UnOrderedFutures
0  SupportTests        0x0000000100753b20 llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 32
1  SupportTests        0x0000000100752974 llvm::sys::RunSignalHandlers() + 68
2  SupportTests        0x0000000100752b18 SignalHandler(int) + 372
3  libc.so.1           0xffffffff7eedc800 __sighndlr + 12
4  libc.so.1           0xffffffff7eecf23c call_user_handler + 852
5  libc.so.1           0xffffffff7eecf594 sigacthandler + 84
6  SupportTests        0x00000001006f8cb8 std::thread::_State_impl<std::thread::_Invoker<std::tuple<llvm::ThreadPool::ThreadPool(llvm::ThreadPoolStrategy)::'lambda'()> > >::_M_run() + 512
7  libstdc++.so.6.0.28 0xfffffffc628117cc execute_native_thread_routine + 16
8  libc.so.1           0xffffffff7eedc6a0 _lwp_start + 0

Since it's effectively impossible to debug such a SEGV in a Release build, I tried a Debug build instead, only to find that the failures had gone away.

Further investigation revealed that most of the issue centers around llvm/lib/Support/ThreadPool.cpp. That file is built with -O3 -fPIC in a Release build. The failure vanishes if

  • compiling without -fPIC
  • compiling with -O -fPIC
  • linking with GNU ld instead of Solaris ld

An attempt to build with -DLLVM_ENABLE_PIC=Off initially failed since neither libRemarks.so (D85626) nor LLVMPolly.so (D85627) heed that option. Even with that fixed, a few codegen failures remain.

Next I tried to build just ThreadPool.cpp with -O -fPIC. While that fixed the vast majority of the failures, 16 LLVM :: CodeGen/X86 failures remained.

Given that that solution was both incomplete and fragile, I went for building the whole tree with -O -fPIC for Release builds.

This is what this patch does, making all failures relative to a Debug build go away.

Tested on sparcv9-sun-solaris2.11.

Diff Detail

Event Timeline

ro created this revision.Aug 10 2020, 12:40 AM
Herald added a project: Restricted Project. · View Herald Transcript
ro requested review of this revision.Aug 10 2020, 12:40 AM
ro added a comment.Aug 17 2020, 1:19 AM

Ping? It's been a week, the patch fixes ca. 250 testsuite failures and is this a candidate for LLVM 11.0.0 rc2.

In D85630#2220694, @ro wrote:

Ping? It's been a week, the patch fixes ca. 250 testsuite failures and is this a candidate for LLVM 11.0.0 rc2.

You should file a bug and put 'release-11.0.0' in the blocker field to ensure this gets cherry-picked into the release/11.x branch.

Presumably this is a bug in something -- either the solaris linker, in g++, or in the llvm code being mis-compiled? It seems unfortunate to put in place a hacky workaround like this, without a bug reference to the responsible component.

hans added a comment.Aug 18 2020, 1:35 AM

I agree with James.

Also, since you're applying this work-around for gcc, I assume the tests pass in the bootstrapped build, i.e. when compiling with clang?

ro added a comment.Aug 18 2020, 2:21 AM

Presumably this is a bug in something -- either the solaris linker, in g++, or in the llvm code being mis-compiled? It seems unfortunate to put in place a hacky workaround like this, without a bug reference to the responsible component.

True. That's because at the time I posted this workaround/hack, the bug hadn't fully been analysed yet. This has changed in the meantime: it's been found that gcc doesn't correctly heed some TLS code sequences. To make things worse, Solaris ld doesn't properly validate its assumptions against the input, generating wrong code.

gld like gcc is more liberal here and correctly deals with the code it gets fed from gcc.

There's PR target/96607: GCC feeds SPARC/Solaris linker with unrecognized TLS sequences now.

ro added a comment.Aug 18 2020, 2:22 AM
In D85630#2220694, @ro wrote:

Ping? It's been a week, the patch fixes ca. 250 testsuite failures and is this a candidate for LLVM 11.0.0 rc2.

You should file a bug and put 'release-11.0.0' in the blocker field to ensure this gets cherry-picked into the release/11.x branch.

Will do. However, by the time I posted the patch it wasn't (and still fully isn't) clear that the patch is ok/acceptable even for master.

ro added a comment.Aug 18 2020, 2:29 AM

I agree with James.

Also, since you're applying this work-around for gcc, I assume the tests pass in the bootstrapped build, i.e. when compiling with clang?

I'd initially restricted the workaround to gcc because the reduced testcase attached to the GCC PR works when compiled with clang++ instead of g++. However, a bootstrap build is still nightmarishly bad on Solaris/sparcv9: the 11.0.0-rc1 one had 4817 failures, a fresh one on master with the workaround applied to both gcc and clang is down to 1365.

So there's at least some improvement, but still a long way to go.

ro updated this revision to Diff 286800.Aug 20 2020, 6:03 AM

Update comment.
Use -O with clang, too.

ro added a comment.Aug 20 2020, 6:15 AM
In D85630#2223313, @ro wrote:

I agree with James.

Also, since you're applying this work-around for gcc, I assume the tests pass in the bootstrapped build, i.e. when compiling with clang?

I'd initially restricted the workaround to gcc because the reduced testcase attached to the GCC PR works when compiled with clang++ instead of g++. However, a bootstrap build is still nightmarishly bad on Solaris/sparcv9: the 11.0.0-rc1 one had 4817 failures, a fresh one on master with the workaround applied to both gcc and clang is down to 1365.

So there's at least some improvement, but still a long way to go.

I've now clarified the gcc situation, adding a reference to the GCC PR for the issue.

I've also managed to reproduce the LLVM 11 2-stage builds on master: 4581 failures with just gcc -O (instead of -O3) vs. 1325 failures with the stage 2 clang built with -O, too.

I couldn't get the 2-stage builds to finish initially because this is a 128-strand box: I had restricted stage 1 to -j24, carefully restricting the lit parallelism using LLVM_LIT_ARGS=-v -j 24. Unfortunately, that isn't automatically passed on to stage 2 and a test run at -j128 totally killed the box, overwhelming even 256 GB RAM.

Unfortunately, I haven't yet made much progress investigating the failures in the 2-stage builds: a pure Release build is useless for that, and a Debug build fails in a different way. I'm trying a RelWithDebInfo right now.

The question is how to proceed, though: with the workaround at least a 1-stage build with gcc shows no errors, so this is at least a basis. Getting a clean 2-stage build will require quite a bit of effort, also in areas long reported/known, but so far ignored (no atomic support in 32-bit unlike gcc, wrong long double on 32-bit sparc).

It would be nice to get this workaround in for now, forming a basis for further work.

ro added a comment.Aug 20 2020, 7:59 AM
In D85630#2228328, @ro wrote:

Unfortunately, I haven't yet made much progress investigating the failures in the 2-stage builds: a pure Release build is useless for that, and a Debug build fails in a different way. I'm trying a RelWithDebInfo right now.

For the record: just switching from a Release to a RelWithDebInfo build (using -O at stages 1 and 2) reduces the number of failures to 140. The vast majority of those are due to the lack of atomics support in 32-bit Sparc.

ro added a comment.Aug 25 2020, 1:08 AM
In D85630#2223304, @ro wrote:
In D85630#2220694, @ro wrote:

Ping? It's been a week, the patch fixes ca. 250 testsuite failures and is this a candidate for LLVM 11.0.0 rc2.

You should file a bug and put 'release-11.0.0' in the blocker field to ensure this gets cherry-picked into the release/11.x branch.

Will do. However, by the time I posted the patch it wasn't (and still fully isn't) clear that the patch is ok/acceptable even for master.

I've now filed Bug 47304 for this.

hans accepted this revision.Aug 27 2020, 5:14 AM

Feels like a hack, but I'm not strongly opposed either.

This revision is now accepted and ready to land.Aug 27 2020, 5:14 AM
ro updated this revision to Diff 288350.Aug 27 2020, 8:40 AM

Handle both Release and RelWithDebInfo builds, downgrade from both -O3 and -O2 to
-O.

ro added a comment.Aug 27 2020, 8:43 AM

Feels like a hack, but I'm not strongly opposed either.

It certainly is. However, even if both gcc were to stop generating invalid TLS sequences today and ld accepted what current gcc generates (possible, but difficult since there's no spec what can be emitted), it would take time for both to become available/be included in a release.

Are you still ok with the revised version? It was prompted by a question in Bug 47304 which showed that I'd missed some cases before.

hans added a comment.Aug 28 2020, 2:25 AM

Are you still ok with the revised version? It was prompted by a question in Bug 47304 which showed that I'd missed some cases before.

Sure, still okay.

This revision was automatically updated to reflect the committed changes.
brad added a subscriber: brad.Wed, Oct 7, 11:40 PM

What about reverting this now that you have found this is an issue with the Sun linker?

ro added a comment.Thu, Oct 8, 1:44 AM

What about reverting this now that you have found this is an issue with the Sun linker?

Knowing the source of the bug doesn't make it go away ;-) As explained in Bug 47304; I'm still awaiting the release of a fixed linker. I do have a patch for HandleLLVMOptions.cmake that only applies the workaround when the linker is broken, but that needs to be adjusted for the right version number.