This is an archive of the discontinued LLVM Phabricator instance.

[ThinLTO] Re-order modules for optimal multi-threaded processing
ClosedPublic

Authored by aganea on Sep 19 2020, 8:31 AM.

Details

Summary

This is a reimplementation of D60495 but with Teresa's suggestion applied: https://reviews.llvm.org/D60495#1562871

I've tested a 3-stage compilation, the graph below shows linking of clang.exe with -flto=thin, /opt:lldltojobs=all, no LTO cache, and -DLLVM_INTEGRATED_CRT_ALLOC=d:\git\rpmalloc on stage 1 & 2 to work around Windows Heap's scaling issues on many-core machines. Test running on 36-core Xeon 6140.

Before (total run is 100 sec):

After patch (total run is 85 sec):

The remaining issue after the falloff in the graph is PassBuilder.cpp which takes a long time to opt+codegen. If that file was split into several .CPPs, I suppose the linking could complete in 70 sec.

Diff Detail

Event Timeline

aganea created this revision.Sep 19 2020, 8:31 AM
aganea requested review of this revision.Sep 19 2020, 8:31 AM
aganea edited the summary of this revision. (Show Details)
aganea edited the summary of this revision. (Show Details)Sep 19 2020, 8:38 AM

Thanks for doing this!

What should I do about that? Some tests could be fixed (ordering) but I am unsure about others.

I took a quick look and the ones I saw seemed to be due to disassembling the wrong temp file, which is based on the numbering provided by "Task". Presumably you want to pass in the original ordering here, so I guess RegularLTO.ParallelCodeGenParallelismLevel + IndexCount?

aganea updated this revision to Diff 293455.Sep 22 2020, 7:54 AM
aganea edited the summary of this revision. (Show Details)

Works as intended, thanks @tejohnson!
All tests pass. Please have another look.

tejohnson accepted this revision.Sep 22 2020, 8:03 AM

LGTM with comment suggestion.

llvm/include/llvm/LTO/LTO.h
94

Nit, it isn't actually producing a reordered container. How about something like "Produces a container ordering for optimal multi-threaded processing. Returns ordered indices to elements in the input array."

This revision is now accepted and ready to land.Sep 22 2020, 8:03 AM
aganea marked an inline comment as done.Sep 22 2020, 8:07 AM
aganea added inline comments.
llvm/include/llvm/LTO/LTO.h
94

Changed, thanks for suggestion!

aganea edited the summary of this revision. (Show Details)Sep 22 2020, 8:23 AM
This revision was landed with ongoing or failed builds.Sep 22 2020, 8:26 AM
This revision was automatically updated to reflect the committed changes.
aganea marked an inline comment as done.

We had to revert due to unexpected side effects on distributed ThinLTO. See the comments on the revert commit about the type of failures. Essentially, for distributed ThinLTO, the backends is of type WriteIndexesThinBackend. It runs serially, not in parallel via threads, and writes all the files needed by the distributed backend processes. One of the files it writes is a list of files to include in the final native link, which is provided to the final link process. The order modules are written into this file is the order they will be linked. With this patch, the order is changed and no longer matches the original link order, and that can produce unexpected failures. Probably the best fix is to simply skip the reordering when BackendProc is a WriteIndexesThinBackend. It won't help anyway given that it is a serial backend.

aganea updated this revision to Diff 297869.Oct 13 2020, 8:20 AM
aganea added a reviewer: rupprecht.

Check for WriteIndexesThinBackend before re-ordering modules.

aganea reopened this revision.Oct 13 2020, 8:21 AM
This revision is now accepted and ready to land.Oct 13 2020, 8:21 AM
aganea requested review of this revision.Oct 13 2020, 8:21 AM

It would be great to test that the ordering of the files in the object file list doesn't change with the write indexes backend. Unfortunately, currently llvm-lto2 doesn't have an option to trigger this output file, but you could use either the gold plugin or lld. There is an existing gold plugin test that does test this output file:
llvm/test/tools/gold/X86/thinlto_emit_linked_objects.ll

It looks like that test would have failed if the object files had been reordered, but it looks like the first one listed is bigger so due to luck we ended up with the same ordering with your original patch. You could probably make a new version of this test that doesn't use --start-lib/--end-lib but rather just passes both bitcode objects to gold normally, and reorder the order the two bitcode files are passed so the Input version which appears smaller is passed first, then confirm that you get the expected ordering in the object list file %t3 (which would be the order that they were passed to gold). Would be good to confirm that without the new guard against reordering with the write indexes backend the ordering gets switched, which is what we want to prevent.

llvm/lib/LTO/LTO.cpp
1468

The ParallelCodeGenParallelismLevel is unrelated to the number of threads we will be using for ThinLTO backends (rather it is related to regular LTO code generation). For the latter you would need to do something like call getThreadCount() on the BackendThreadPool on the InProcessThinBackend.

1469

Needs comment as to why we are not doing this for WriteIndexesKind

aganea updated this revision to Diff 297971.Oct 13 2020, 3:15 PM
aganea marked 2 inline comments as done.

Simplify & address comments.

aganea added a comment.EditedOct 13 2020, 3:15 PM

@tejohnson I added the test in LLD, since the gold tests only run on Linux, which is harder for me to test & debug. The test fails when the following block is removed: if (BackendProc->getThreadCount() == 1) { ... }.
This test also implictly covers the "InProcessThinBackend" codepath with /opt:lldltojobs=1 which I don't see how to cover explicitly otherwise.

llvm/lib/LTO/LTO.cpp
1468

Fixed by adding a virtual ThinBackendProc::getThreadCount() API.

1469

With the new ThinBackendProc::getThreadCount() things are a bit more explicit I think.

tejohnson accepted this revision.Oct 13 2020, 3:51 PM

@tejohnson I added the test in LLD, since the gold tests only run on Linux, which is harder for me to test & debug. The test fails when the following block is removed: if (BackendProc->getThreadCount() == 1) { ... }.

Great, thanks! LGTM with request for comment as described below.

This test also implictly covers the "InProcessThinBackend" codepath with /opt:lldltojobs=1 which I don't see how to cover explicitly otherwise.

I don't see how this test covers that code path since it isn't using in process backends, but I'm not really sure how to test this case effectively - like the original patch, it is just a performance change on the in process backends case.

llvm/lib/LTO/LTO.cpp
1469

The virtual method looks good and cleaner. But I'd like to capture in a comment somewhere that for write indexes thin backend there is a user visible effect if we were to reorder, in that the LinkedObjectsFile written for distributed ThinLTO will contain the objects in a different order, which will in turn affect the final link order assuming that file is used as intended. I.e. we don't just skip it in the thread = 1 case because it is unnecessary.

This revision is now accepted and ready to land.Oct 13 2020, 3:51 PM
This revision was landed with ongoing or failed builds.Oct 13 2020, 6:54 PM
This revision was automatically updated to reflect the committed changes.
aganea marked an inline comment as done.

@tejohnson I added the test in LLD, since the gold tests only run on Linux, which is harder for me to test & debug. The test fails when the following block is removed: if (BackendProc->getThreadCount() == 1) { ... }.

Great, thanks! LGTM with request for comment as described below.

This test also implictly covers the "InProcessThinBackend" codepath with /opt:lldltojobs=1 which I don't see how to cover explicitly otherwise.

I don't see how this test covers that code path since it isn't using in process backends, but I'm not really sure how to test this case effectively - like the original patch, it is just a performance change on the in process backends case.

I meant if we have:
; RUN: lld-link /lldsavetemps /out:%t4.exe /entry:main /subsystem:console %t2.obj %t1.obj /opt:lldltojobs=1
We end up in if (BackendProc->getThreadCount() == 1) as if we did:
RUN: lld-link -thinlto-index-only:%t3 /entry:main %t1.obj %t2.obj

llvm/lib/LTO/LTO.cpp
1469

I've added more relevant comments, as suggested.

@tejohnson I added the test in LLD, since the gold tests only run on Linux, which is harder for me to test & debug. The test fails when the following block is removed: if (BackendProc->getThreadCount() == 1) { ... }.

Great, thanks! LGTM with request for comment as described below.

This test also implictly covers the "InProcessThinBackend" codepath with /opt:lldltojobs=1 which I don't see how to cover explicitly otherwise.

I don't see how this test covers that code path since it isn't using in process backends, but I'm not really sure how to test this case effectively - like the original patch, it is just a performance change on the in process backends case.

I meant if we have:
; RUN: lld-link /lldsavetemps /out:%t4.exe /entry:main /subsystem:console %t2.obj %t1.obj /opt:lldltojobs=1
We end up in if (BackendProc->getThreadCount() == 1) as if we did:
RUN: lld-link -thinlto-index-only:%t3 /entry:main %t1.obj %t2.obj

Oh, I see what you are saying. Makes sense. New comment looks good, thanks.