This is an archive of the discontinued LLVM Phabricator instance.

LTO: Add support for multi-module bitcode files.
ClosedPublic

Authored by pcc on Dec 1 2016, 11:36 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

pcc updated this revision to Diff 79959.Dec 1 2016, 11:36 AM
pcc retitled this revision from to LTO: Add support for multi-module bitcode files..
pcc updated this object.
pcc added a reviewer: mehdi_amini.
pcc added a subscriber: llvm-commits.
mehdi_amini added inline comments.Dec 9 2016, 8:36 PM
clang/lib/CodeGen/BackendUtil.cpp
771 ↗(On Diff #79959)

The error message could be a bit more specific here.

775 ↗(On Diff #79959)

We load the first BitcodeModule, should we assert that this is the only one? How do we know this is the right one?

llvm/include/llvm/LTO/LTO.h
224 ↗(On Diff #79959)

Since we have the two, it seems worth documenting these.

421 ↗(On Diff #79959)

This is a pretty annoying API with ResI as an InOut parameter: at least document it.

llvm/lib/LTO/LTO.cpp
187 ↗(On Diff #79959)

Document.

190 ↗(On Diff #79959)

Add a comment to explain why it is out-of-line.

200 ↗(On Diff #79959)

Some high level one or two sentences commenting what is going to happen below in the loop would be welcome I think.

212 ↗(On Diff #79959)

Is there an expectation that no symbol can be duplicated in the various modules?

250 ↗(On Diff #79959)

You can create an InputFile where we wouldn't find any module, if you don't want to support that we should detect it in the create() and return an error there. Otherwise this is UB to be called on a valid created InputFile

276 ↗(On Diff #79959)

Add a comment to explain why it is out-of-line.

456 ↗(On Diff #79959)

We'll arrive here from for (InputFile::InputModule &IM : Input->Mods) in LTO::add(), how are we distinguishing BM.getModuleIdentifier() when there are multiple modules per file?
I assume we don't expect multiple modules with a summary in the same input file, but in this case we should check here that the module has been inserted and error otherwise.

pcc marked 10 inline comments as done.Dec 13 2016, 4:03 PM
pcc added inline comments.
clang/lib/CodeGen/BackendUtil.cpp
775 ↗(On Diff #79959)

I added some code that makes sure that we load the right module.

llvm/lib/LTO/LTO.cpp
212 ↗(On Diff #79959)

If you mean that the modules may not both contain a definition of the symbol, then yes. The entity creating these bitcode files is expected to adhere to this policy. (The same applies to definitions of comdats, so the code on lines 215-221 below remains the same.)

It's possible to have a defined symbol in one module and a duplicate undefined symbol with the same name in another module; the situation is similar to module inline asm right now (i.e. PR30396). Whatever solution we come up with for that problem should also apply to this one.

250 ↗(On Diff #79959)

Added a check to create().

456 ↗(On Diff #79959)

BM.getModuleIdentifier() will be the same for each BitcodeModule in the same input file by construction, so we can just test the return value of insert here.

(I feel obliged to point out that this is yet another scenario where (path, byte slice) pairs would work better as a representation of ThinLTO modules.)

pcc updated this revision to Diff 81322.Dec 13 2016, 4:03 PM
pcc marked 3 inline comments as done.
  • Address review comments
mehdi_amini added inline comments.Dec 13 2016, 4:49 PM
clang/lib/CodeGen/BackendUtil.cpp
783 ↗(On Diff #81322)

What if I.first was already present in the map? It seems the same case as below for ThinLTO where we should check that the returned value of insert().second .

llvm/lib/LTO/LTO.cpp
456 ↗(On Diff #79959)

What is the other scenario? I don't remember what was specific to ThinLTO for this?

Also a pair path/byte slice doesn't fit the API in general, does it? We don't necessarily have a "file".
That said, instead of path the identifier could be a <BufferID, offset>, with BufferID intended to be a unique identifier provided by the client of the API per-buffer. I'm not sure how deep we'd have to thread this though. I think that as long as when we create a llvm::Module the identifier is consistent with whatever is used here, everything should be OK.

mehdi_amini accepted this revision.Dec 13 2016, 4:49 PM
mehdi_amini edited edge metadata.

LGTM otherwise.

This revision is now accepted and ready to land.Dec 13 2016, 4:49 PM
pcc added inline comments.Dec 13 2016, 5:16 PM
clang/lib/CodeGen/BackendUtil.cpp
783 ↗(On Diff #81322)

That doesn't seem like it can happen. The range for loop starting on line 758 enumerates elements of a map, so each I.first() will be unique. We also break out of this loop as soon as we see a module with a summary.

llvm/lib/LTO/LTO.cpp
456 ↗(On Diff #79959)

What is the other scenario? I don't remember what was specific to ThinLTO for this?

The main other scenario is (non-thin) archive files, which is already a problem as we don't handle that correctly in the distributed case, and we have workarounds in both the gold plugin and lld to give -save-temps temporaries appropriate names. This is ThinLTO-specific because we don't create temporaries for each input file under regular LTO.

Also a pair path/byte slice doesn't fit the API in general, does it? We don't necessarily have a "file".

True, but it's also the case that we don't necessarily have a "file" now and we're already passing paths around. I imagine that clients which don't use files could just make something up for the offsets as they would make something up for file paths.

That said, instead of path the identifier could be a <BufferID, offset>, with BufferID intended to be a unique identifier provided by the client of the API per-buffer.

I think that could work as well.

This revision was automatically updated to reflect the committed changes.
mehdi_amini added inline comments.Dec 13 2016, 5:37 PM
llvm/lib/LTO/LTO.cpp
456 ↗(On Diff #79959)

In the C API (and thus in ld64), the expectation is that every input buffer has to be provided with a unique id. It happens that this ID is usually the path on file (with libFoo.a(member.o) when a static archive is involved).
Because the linker is supplying one member at a time, the unique ID in the API is enough.

Having multiple modules per file is new though and would break this "unique ID" provided by the linker, we could suffix it when loading each individual bitcode/module though.

The --save-temps is always using integer right now I believe (files are 1.thin.o, 2.thin.o, etc.).

True, but it's also the case that we don't necessarily have a "file" now and we're already passing paths around.

That's just us being inconsistent, I tried to express "ModuleID" instead of path as much as possible.

pcc added a subscriber: davide.Dec 13 2016, 5:51 PM
pcc added inline comments.
llvm/lib/LTO/LTO.cpp
456 ↗(On Diff #79959)

Because the linker is supplying one member at a time, the unique ID in the API is enough.

I think @davide had a case where an archive had two members with the same name, and they were only distinguishable by offset.

The --save-temps is always using integer right now I believe (files are 1.thin.o, 2.thin.o, etc.).

Not always: http://llvm-cs.pcc.me.uk/lib/LTO/LTOBackend.cpp#77
gold and lld both pass UseInputModulePath == true here.

We probably want to overhaul how the filenames look for save-temps, but I think in the end they should contain the module ID in some form.

Yes, thin-archivecollision.ll in lld/test/lto is an example, but it could be even worse because the two members can be part of the same archive. I find it weird that's possible to create an archive with two distinct members and the same name, but apparently ar consider this a legitimate operation.

mehdi_amini added inline comments.Dec 13 2016, 6:00 PM
llvm/lib/LTO/LTO.cpp
456 ↗(On Diff #79959)

I think @davide had a case where an archive had two members with the same name, and they were only distinguishable by offset

I remember this, I wouldn't be too much concerned *here* with this and push to the client the responsibility to provide unique IDs. (Nothing prevent the linker to build an id that always include the offset for instance)

pcc added inline comments.Dec 13 2016, 6:16 PM
llvm/lib/LTO/LTO.cpp
456 ↗(On Diff #79959)

Well, in the distributed case some component needs to understand the module IDs in order to distribute the work properly. Part of that may involve translating the module IDs into "file paths" (which may be actual file paths or conceptual ones). If we can arrange to use the same ("file path", offset) scheme in all linkers, that component can be shared between linkers.

But to a certain extent this is all hypothetical, I think I'd want to prototype before being sure that this is the right design.