Page MenuHomePhabricator

[llvm-lto2] By default, use two threads for ThinLTO backend.
Needs ReviewPublic

Authored by ychen on Jan 19 2020, 5:50 PM.

Details

Summary

Some bots are reporting Resource temporarily unavailable because of D67847.
http://lab.llvm.org:8011/builders/clang-ppc64le-rhel/builds/592/steps/ninja%20check%201/logs/stdio

It is likely that too many threads are spawned since lit -j and
thin backend parallelism are in effect at same time. (One stack for each thread could bloat memory quickly)

terminate called after throwing an instance of 'std::system_error'

what():  Resource temporarily unavailable

Stack dump:
0. Program arguments: /home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/llvm-lto2 /home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/llvm-lto2 -O0 -save-temps /home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/test/ThinLTO/X86/Output/index-const-prop-O0.ll.tmp2.bc -r=/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/test/ThinLTO/X86/Output/index-const-prop-O0.ll.tmp2.bc,g,pl /home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/test/ThinLTO/X86/Output/index-const-prop-O0.ll.tmp1.bc -r=/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/test/ThinLTO/X86/Output/index-const-prop-O0.ll.tmp1.bc,main,plx -r=/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/test/ThinLTO/X86/Output/index-const-prop-O0.ll.tmp1.bc,g, -o /home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/test/ThinLTO/X86/Output/index-const-prop-O0.ll.tmp3
#0 0x00007357b43b1a34 PrintStackTraceSignalHandler(void*) (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMSupport.so.11git+0x1f1a34)
#1 0x00007357b43ae7a8 llvm::sys::RunSignalHandlers() (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMSupport.so.11git+0x1ee7a8)
#2 0x00007357b43b1ea4 SignalHandler(int) (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMSupport.so.11git+0x1f1ea4)
#3 0x00007357b8ac04d8 0x4d8 abort
#4 0x00007357b8ac04d8 /usr/src/debug/glibc-2.17-c758a686/stdlib/abort.c:75:0
#5 0x00007357b8ac04d8 gnu_cxx::verbose_terminate_handler() (+0x4d8)
#6 0x00007357b3cd20b4 (/lib64/libc.so.6+0x420b4)
#7 0x00007357b403eda4 std::terminate() (/lib64/libstdc++.so.6+0x8eda4)
#8 0x00007357b403b5d4 cxa_throw (/lib64/libstdc++.so.6+0x8b5d4)
#9 0x00007357b403b624 std::
throw_system_error(int) (/lib64/libstdc++.so.6+0x8b624)
#10 0x00007357b403baa8 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib64/libstdc++.so.6+0x8baa8)
#11 0x00007357b43b5290 llvm::ThreadPool::ThreadPool(unsigned int) (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMSupport.so.11git+0x1f5290)
#12 0x00007357b43b77b8 std::_Function_handler<std::unique_ptr<llvm::lto::ThinBackendProc, std::default_delete<llvm::lto::ThinBackendProc> > (llvm::lto::Config const&, llvm::ModuleSummaryIndex&, llvm::StringMap<llvm::DenseMap<unsigned long, llvm::GlobalValueSummary*, llvm::DenseMapInfo<unsigned long>, llvm::detail::DenseMapPair<unsigned long, llvm::GlobalValueSummary*> >, llvm::MallocAllocator>&, std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)>, std::function<std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)> (unsigned int, llvm::StringRef)>), llvm::lto::createInProcessThinBackend(unsigned int)::$_9>::_M_invoke(std::_Any_data const&, llvm::lto::Config const&, llvm::ModuleSummaryIndex&, llvm::StringMap<llvm::DenseMap<unsigned long, llvm::GlobalValueSummary*, llvm::DenseMapInfo<unsigned long>, llvm::detail::DenseMapPair<unsigned long, llvm::GlobalValueSummary*> >, llvm::MallocAllocator>&, std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)>&&, std::function<std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)> (unsigned int, llvm::StringRef)>&&) (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMSupport.so.11git+0x1f77b8)
#13 0x00007357b433da34 llvm::lto::LTO::runThinLTO(std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)>, std::function<std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)> (unsigned int, llvm::StringRef)>, llvm::DenseSet<unsigned long, llvm::DenseMapInfo<unsigned long> > const&) (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMSupport.so.11git+0x17da34)
#14 0x00007357b487c124 llvm::lto::LTO::run(std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)>, std::function<std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)> (unsigned int, llvm::StringRef)>) (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMLTO.so.11git+0x2c124)
#15 0x00007357b4874354 run(int, char**) (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMLTO.so.11git+0x24354)
#16 0x00007357b4873150 main (/home/docker/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/stage1/bin/../lib/libLLVMLTO.so.11git+0x23150)
#17 0x00000000100127cc generic_start_main.isra.0 /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:266:0
#18 0x000000001000f6dc __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:81:0

Diff Detail

Event Timeline

ychen created this revision.Jan 19 2020, 5:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 19 2020, 5:50 PM

Unit tests: pass. 61907 tests passed, 0 failed and 782 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Do you know why it did not fail with what(): Resource temporarily unavailable before you committed D67847 (which has been reverted)?

I don't really understand how D67847 is creating the problem?

ychen added a comment.EditedJan 19 2020, 10:59 PM

@MaskRay @mehdi_amini I have no clue either on how this does not happen before. Adding std::__throw_system_error in ThreadPool ctor and reverting D67847, I still see the crash in my machine (Ubuntu 18.04). I'm inclined to think the exception did not trigger before rather than it was triggered but somehow muted in some way.

FYI, besides the problem this change wanted to address, D67847 also triggered a random host compiler kill on PPC64be bots.
http://lab.llvm.org:8011/builders/clang-ppc64be-linux/builds/43016/steps/build%20stage%201/logs/stdio

Both issues are related to system resource constraints. But how they are related to D67847 remains a mystery to me. PPC64 bots admin said they are willing to help diagnosing these issues.

Do you think this change is worthwhile in general? I don't think any tests are relying on the number of backend threads and using less resources should be beneficial.

Honestly, this looks weird (why 2 threads not 1 for example)? Also for anyone running lto on multicore machine results would be surprising at the very least.

What's the value of cat /proc/sys/kernel/threads-max on build bot? On my machine it is ~84K, which seems enough even for lit/lto being run on 64 cores.
Also where exactly are you getting EAGAIN (sys call, library call)? You can try getting libstdc++.so.6 from build bot and disassembling it on reported offset.

I tend to ignore failures on clang-ppc64be-linux before... From my experience it was quite unstable and often failed with c++: internal compiler error: Killed (program cc1plus). Not sure if it has improved. If it is the only failing bot I tend to ignore it. Maybe we can ask the admin of the build bot.

ychen edited the summary of this revision. (Show Details)Jan 20 2020, 9:02 AM
ychen added a comment.Jan 20 2020, 9:18 AM

Honestly, this looks weird (why 2 threads not 1 for example)?

1 feels very special for multithreading code. I've seen code with separate paths for 1 and > 1, although I'm not aware of any similar code for ThinLTO.

Also for anyone running lto on multicore machine results would be surprising at the very least.

Could you please elaborate a bit on this? Correct me if I'm wrong, my understanding is that this does not change results and llvm-lto2 which is usually for testing. It feels not very sensible to me that by default we need to activate all cores (also for Docker, each container has all the cores available.) to run the backend since there are only a few compile units for the common case. So the worst case would be <num of containers> * <lit -j> * <num of cores>.

What's the value of cat /proc/sys/kernel/threads-max on build bot? On my machine it is ~84K, which seems enough even for lit/lto being run on 64 cores.

Sorry, I didn't make myself clear. Memory saturates first because of the high number of threads. /proc/sys/kernel/threads-max should be large enough usually.

Also where exactly are you getting EAGAIN (sys call, library call)? You can try getting libstdc++.so.6 from build bot and disassembling it on reported offset.

From the stack trace, libstdc++.so throws it. Thanks for the suggestion. I think it is a great way to diagnose the issue.

ychen added a comment.Jan 20 2020, 9:24 AM

I tend to ignore failures on clang-ppc64be-linux before... From my experience it was quite unstable and often failed with c++: internal compiler error: Killed (program cc1plus). Not sure if it has improved. If it is the only failing bot I tend to ignore it. Maybe we can ask the admin of the build bot.

I just got an mail from them that they think it is the memory going over the top and dialed down the building parallelism. Since D67847 changes only cpp file, probably there were too many linking jobs at the same time.