This is an archive of the discontinued LLVM Phabricator instance.

[COFF] add /errorrepro to save reproducer on error
AbandonedPublic

Authored by inglorion on Jun 5 2018, 2:25 PM.

Details

Summary

This adds an /errorrepro:<dir> flag to lld-link. The flag
instructs the linker to write a repro.tar file in the specified
directory if linking fails. The functionality is modeled after
/linkrepro, which unconditionally writes the reproducer.

Event Timeline

inglorion created this revision.Jun 5 2018, 2:25 PM
ruiu added a comment.Jun 5 2018, 2:32 PM

What was the motivation for this new option? Given that it is easy to write run the same command again with /linkrepro, I don't see an immediate need to add this as a new option.

I am investigating https://bugs.chromium.org/p/chromium/issues/detail?id=786127, which is a failure that does not reproduce reliably. I could use /linkrepro to save all link inputs always, but that would use up a lot of disk space. Running the command again with /linkrepro will almost certainly not reproduce the failure. Having a way to only save the reproducer when an error occurs would help here, and also seems a generally useful feature to have.

ruiu added inline comments.Jun 6 2018, 11:14 AM
lld/COFF/Driver.cpp
58

Is this really called when lld exits? I think by default lld calls _exit() to avoid the cost of destruction, so dtors are not called on exit.

pcc added a subscriber: pcc.Jun 6 2018, 11:29 AM

If the test failure is related to LTO, would this really help? As I understand it, the failures are on the object files created by LTO, which I don't think would appear in the linkrepro file.

I wrote a small patch:

diff --git a/lld/COFF/Chunks.cpp b/lld/COFF/Chunks.cpp
index 6d8a0c7c1f4..2931bc23d21 100644
--- a/lld/COFF/Chunks.cpp
+++ b/lld/COFF/Chunks.cpp
@@ -97,9 +97,11 @@ void SectionChunk::applyRelX64(uint8_t *Off, uint16_t Type, OutputSection *OS,
   case IMAGE_REL_AMD64_REL32_5:  add32(Off, S - P - 9); break;
   case IMAGE_REL_AMD64_SECTION:  applySecIdx(Off, OS); break;
   case IMAGE_REL_AMD64_SECREL:   applySecRel(this, Off, OS, S); break;
-  default:
+  default: {
+    llvm::outs() << toHex(File->MB.getBuffer()) << '\n';
     fatal("unsupported relocation type 0x" + Twine::utohexstr(Type));
   }
+  }
 }

 void SectionChunk::applyRelX86(uint8_t *Off, uint16_t Type, OutputSection *OS,

and used it to build chromium in a loop on my machine overnight. With that I was able to reproduce the problem and capture one of the object files. I'll take a look at it.

Thanks, @pcc . Out of curiosity, how did you build chrome (I guess I'm mostly interested in args.gn and the command you used to build). I've been trying this on my machine and it can literally go for weeks without hitting the problem.

inglorion added inline comments.Jun 6 2018, 11:45 AM
lld/COFF/Driver.cpp
58

Good point. In an earlier version I had implemented hooks to be run by exitLld() so that I could write the error repro there. This version was an attempt to simplify that, based on the idea that the destructor for ReproWriter is run before control leaves LinkerDriver::link(). That works fine for some cases, but not cases where we exceed the error limit and ExitEarly is true, or we call fatal(). Let me see if I can come up with a better way.

pcc added a comment.Jun 6 2018, 12:27 PM

Here is the args.gn I was using. It's basically the same as the bot:

is_clang = true
is_component_build = false
is_debug = false
is_official_build = true
symbol_level = 2
use_lld = true
use_thin_lto = true

I also had a local patch to chrome to get it to use my hacked lld:

diff --git a/build/toolchain/win/BUILD.gn b/build/toolchain/win/BUILD.gn
index 4d9d1f45f870..8e4a30b223ba 100644
--- a/build/toolchain/win/BUILD.gn
+++ b/build/toolchain/win/BUILD.gn
@@ -109,7 +109,7 @@ template("msvc_toolchain") {
       # lld-link includes a replacement for lib.exe that can produce thin
       # archives and understands bitcode (for lto builds).
       lib = "$prefix/$lld_link /lib /llvmlibthin"
-      link = "$prefix/$lld_link"
+      link = "C:/src/l/ra/bin/lld-link.exe"
       if (host_os != "win") {
         # See comment adding --rsp-quoting to $cl above for more information.
         link = "$link --rsp-quoting=posix"

I built it with this batch file:

for /l %%x in (1, 1, 100) do (
  mkdir llvm_exp%%x
  cd llvm_exp%%x
  copy ..\args.gn .
  gn gen .
  ninja all > ninja.out
  cd ..
)
inglorion added inline comments.Jun 12 2018, 3:26 PM
lld/COFF/Driver.cpp
58

I think we are guaranteed that we will either leave the scope of LinkerDriver::link(), or we will call exitLld() (there is another case where we die from a signal, but that's not a case I'm looking to address here; we can address that in a later patch if we want to).

This patch as it is already addresses the case of going out of scope, so we need to address the exitLld() case. exitLld() already calls llvm_shutdown(), so we could use ManagedStatic...but I prefer to actually add a way to register hooks for exitLld().

I'll go ahead and add that, unless someone objects.

inglorion updated this revision to Diff 151055.Jun 12 2018, 3:51 PM

Ensure reproducer is also written when calling exitLld().

inglorion updated this revision to Diff 151056.Jun 12 2018, 3:52 PM

Re-upload.

inglorion marked an inline comment as done.Jun 12 2018, 3:52 PM
pcc added a comment.Jun 12 2018, 4:03 PM

I still don't see a compelling reason to do this to be honest. The use case we had in mind was to help debug an intermittent failure, but as we discovered it wouldn't have helped debug that failure. Also, if a failure is intermittent then it can be reproduced just as much with /errorrepro as with /linkrepro.

I still don't see a compelling reason to do this to be honest.

Fair enough, and if the consensus is that we'd rather not have this, I'll withdraw it. You've reduced the problem that originally prompted me to write this (and you're right it doesn't capture the object files LTO generates). However, aside from that specific problem, I still think that something like this would be useful to diagnose linker errors we may encounter in the future, particularly ones that don't reproduce reliably. As I pointed out earlier, /linkrepro isn't really what I would want for that because you either need to re-run the link until you hit the same failure again, which could be very time consuming, or pre-emptively run with /linkrepro, which slows down links and writes lots of bytes you usually don't care about.

pcc added a comment.Jun 12 2018, 5:50 PM

As I pointed out earlier, /linkrepro isn't really what I would want for that because you either need to re-run the link until you hit the same failure again, which could be very time consuming, or pre-emptively run with /linkrepro, which slows down links and writes lots of bytes you usually don't care about.

You could always rerun the link with /linkrepro if the link fails. Even if the second link succeeds, you still have a reproducer that you can try repeatedly until you hit the failure.

inglorion abandoned this revision.Jun 13 2018, 2:50 PM

After checking with ruiu, it seems we are all on the side of not adding this. Abandoning.