This is a follow-up for D70378 (Cover usage of LLD as a library).
I'm increasingly convinced that it is impossible to support re-entrance once a failure occurs in LLD, even if that currently works on the surface.
While debugging an intermittent failure on a bot, I recalled this scenario which causes the issue:
- When executing lld/test/ELF/invalid/symtab-sh-info.s L45, we reach lld::elf::ObjFile::ObjFile() which goes straight into its base ELFFileBase(), then ELFFileBase::init().
- At that point fatal() is thrown in lld/ELF/InputFiles.cpp L381, leaving a half-initialized ObjFile instance.
- We end up in lld::exitLld() and since we are running with LLD_IN_TEST, we hapily restore the control flow to CrashRecoveryContext::RunSafely() then back in lld::safeLldMain().
- Before this patch, we called errorHandler().reset() just after, and this attempted to reset the associated SpecificAlloc<ObjFile<ELF64LE>>. That tried to free the half-initialized ObjFile instance, and more precisely its ObjFile::dwarf member.
Sometimes that worked, sometimes it failed and was catched by the CrashRecoveryContext. This scenario was the reason we called errorHandler().reset() through a CrashRecoveryContext.
But in some rare cases, the above repro somehow corrupted the heap, creating a stack overflow. When the CrashRecoveryContext's filter (that is, __except (ExceptionFilter(GetExceptionInformation()))) tried to handle the exception, it crashed again since the stack was exhausted -- and that took the whole application down. That is the issue seen on the bot. Locally it happens about 1 times out of 15.
Now this situation can happen anywhere in LLD. Since catching stack overflows is not a reliable scenario when using CrashRecoveryContext, we're now preventing further re-entrance when such failures occur, by signaling lld::SafeReturn::canRunAgain=false. When running with LLD_IN_TEST=2, only one iteration will be executed in this case.
Nit: newer functions should stick with the naming rule. For Exit, you can keep it capitalized to reduce disruption since you just add a default argument which does not require all call sites to change.