This is an archive of the discontinued LLVM Phabricator instance.

Add Statically Linked Libraries
ClosedPublic

Authored by ashi1 on Apr 23 2020, 2:05 PM.

Details

Summary

Add GNU Static Lib Tool, which supports the --emit-static-lib
flag. For HIP, a static library archive will be created and
consist of HIP Fat Binary host object with the device images embedded.
Using MC Directives, embed the device images and define symbols.
Using llvm-ar to create the static archive. Also, delete existing
output file to ensure a new archive is created each time.

Diff Detail

Event Timeline

ashi1 created this revision.Apr 23 2020, 2:05 PM
ashi1 updated this revision to Diff 259712.Apr 23 2020, 2:16 PM

Adding a new diff with full context.

clang/lib/Driver/ToolChain.cpp
579

What does ar do with the symbol table for this? I'd expect it to include host symbols but none of the device ones

tra added inline comments.Apr 23 2020, 4:05 PM
clang/lib/Driver/ToolChain.cpp
579

Also, can we use llvm-ar? It should reduce the amount of surprises we may run into with whatever ar we may find in the PATH.

ashi1 marked 3 inline comments as done.Apr 24 2020, 12:24 PM
ashi1 added inline comments.
clang/lib/Driver/ToolChain.cpp
579

Hi Jon, the ar tool will embed the host objects, .o, and the device fat binary object, .hipfbo, into a global symbol __hip_fatbin. HIP runtime has the capability to extract this symbol, and digest the HIP fat binary, and extract the device symbols to run.

Hi tra, I will update this to use llvm-ar, thanks.

ashi1 updated this revision to Diff 259969.Apr 24 2020, 12:53 PM

Changed to use llvm-ar tool.

yaxunl accepted this revision.Apr 30 2020, 1:13 PM

LGTM. Thanks.

clang/lib/Driver/ToolChains/CommonArgs.cpp
1427

should be HIPToolChain

1475

extra blank line

This revision is now accepted and ready to land.Apr 30 2020, 1:13 PM
tra added a comment.Apr 30 2020, 2:55 PM

Added GNU Static Lib Tool, which supports the --hip-emit-static-lib flag with -fgpu-rdc. A static library archive will be created and include both the host object and device object embedded.

What's going to be in that static library in the end? The description above is somewhat confusing. On one hand it says include both the host object and device object embedded.. Does it mean host object and device object to be embeddeed? I.e. two separate object files? Or does it mean that the device object is embedded in the host object and the host object is included in the library?

Using llvm-mc to embed the HIP fat binary into __hip_fatbin symbol, and system ar to archive. Also, delete existing output file to ensure a new archive with index is created each time.

__hip_fatbin, presumably in the host object. If the device object is embedded in the host object, why do we need to put it into the library? In the end theuser would only see the host object and we may just link it directly without making it a library. What do I miss?

It would be great to add some details to the test, because right now it's really hard to figure out what's being done and almost impossible to understand why it is done that way.

The test example is also odd in a sense that it compiles two source files. Is the purpose of this patch to allow creation of a self-contained archive with GPU-side objects that still need final linking step further down the build pipeline?

If that's indeed the driving motivation behind the patch, it may need some improvement, as it's unlikely to work well in practice, where pretty much every build system compiles one file at a time. Giving single compilation large number of inputs is simply not going to scale in practice. We need to figure out a way to implement the final link for object files compiled with -fgpu-rdc in separate compilations.

The purpose of this patch is to generate a static library that can be used as an "ordinary" static library, i.e., can be linked by a system linker, given a bunch of HIP programs that need -fgpu-rdc to compile.

This is a common use case for a library written with HIP. That is, the library provides a bunch of host APIs so that other C++ programs can use it. The library does not provide device code that need to be linked with the user's device code. The users want to use it as an ordinary host library.

If the users of the library just need dynamic library, then they do not need this feature, since clang is already able to do that.

However, clang is not able to generate a static library for that purpose. This is what this feature is trying to achieve.

So basically this patch will link the device code in these HIP programs, generate device ISA, and create a host object containing the ISA, and add the host object to the generated static library. This is exactly what clang does for -fgpu-rdc when generating a shared library. The only difference is that, instead of generating a shared library, now clang generates a static library.

tra added a comment.Apr 30 2020, 5:03 PM

The purpose of this patch is to generate a static library that can be used as an "ordinary" static library, i.e., can be linked by a system linker, given a bunch of HIP programs that need -fgpu-rdc to compile.

This is a common use case for a library written with HIP. That is, the library provides a bunch of host APIs so that other C++ programs can use it. The library does not provide device code that need to be linked with the user's device code. The users want to use it as an ordinary host library.

Is the intention that all GPU-side code inside this library is linked together, even though individual sources are compiled to object files?

Is there a way to compile .hip ->.o with GPU-side object and then
link together GPU objects from multiple such .o files into a single .o with partially linked host object and fully linked GPU code?

If the users of the library just need dynamic library, then they do not need this feature, since clang is already able to do that.

Can you give me an example how this shared library is produced?

However, clang is not able to generate a static library for that purpose. This is what this feature is trying to achieve.

So basically this patch will link the device code in these HIP programs, generate device ISA, and create a host object containing the ISA, and add the host object to the generated static library. This is exactly what clang does for -fgpu-rdc when generating a shared library. The only difference is that, instead of generating a shared library, now clang generates a static library.

Can you give me a small example of using clang to generate such .so ? If this approach relies on compiling multiple *source* files at once, that will have practicall use issues. If it can be done on .o as inputs, then it may be OK.

ashi1 added a comment.May 12 2020, 7:35 AM
In D78759#2014185, @tra wrote:

The purpose of this patch is to generate a static library that can be used as an "ordinary" static library, i.e., can be linked by a system linker, given a bunch of HIP programs that need -fgpu-rdc to compile.

This is a common use case for a library written with HIP. That is, the library provides a bunch of host APIs so that other C++ programs can use it. The library does not provide device code that need to be linked with the user's device code. The users want to use it as an ordinary host library.

Is the intention that all GPU-side code inside this library is linked together, even though individual sources are compiled to object files?

Is there a way to compile .hip ->.o with GPU-side object and then
link together GPU objects from multiple such .o files into a single .o with partially linked host object and fully linked GPU code?

When clang detects the .o files are clang-offload-bundle or host object, it will extract the bitcode and link them and generate ISA, then create a host object containing it, then link them together. We can link multiple objects together to create either a shared library or a static library (with this patch).

If the users of the library just need dynamic library, then they do not need this feature, since clang is already able to do that.

Can you give me an example how this shared library is produced?

In a ROCm project, RCCL is building the shared library using -fgpu-rdc. That project compiles sources using -c and -fgpu-rdc, and then links them together into a shared library. We require this patch to allow them to produce a static library too.

However, clang is not able to generate a static library for that purpose. This is what this feature is trying to achieve.

So basically this patch will link the device code in these HIP programs, generate device ISA, and create a host object containing the ISA, and add the host object to the generated static library. This is exactly what clang does for -fgpu-rdc when generating a shared library. The only difference is that, instead of generating a shared library, now clang generates a static library.

Can you give me a small example of using clang to generate such .so ? If this approach relies on compiling multiple *source* files at once, that will have practicall use issues. If it can be done on .o as inputs, then it may be OK.

Yes, this can be done on .o files. If you have multiple hip objects, foo.o and bar.o with their own device code, we can pass them into clang to generate a shared library file. clang currently extracts their bitcode, and links them together, and creates host objects containing the device code, and will link them together. It can accept multiple .o as inputs. With this patch, using the flag hip-emit-static-lib, we can generate a .a static library file.

When clang detects the .o files are clang-offload-bundle or host object, it will extract the bitcode and link them and generate ISA, then create a host object containing it, then link them together. We can link multiple objects together to create either a shared library or a static library (with this patch).

The semantics of this patch are a bit unclear to me.

When linking a static archive, the linker searches for each currently unresolved symbol in the archive and links in only the object that exposes that symbol. Unused objects are left in the archive. That's useful behaviour in various settings, and overridden with --whole-archive when all the objects are wanted.

The behaviour I would expect, from a user perspective, is to link in the host code which satisfies unresolved symbols for the host, and the device code that satisfies unresolved symbols for the device, and no extra files, regardless of whether the files are machine code for either arch, bitcode for either arch, or an offload-bundle potentially containing some of each. That's what we have today for mixtures of bitcode and machine code for a single architecture. And extensible in the obvious fashion for > 2 architectures.

That seems implementable - skip over objects for architectures other than the one currently of interest, and reach into clang-offload-bundle if present to do the same thing. Optionally provide a symbol table per arch to make it faster.

I think you're suggesting that the bitcode from every object is extracted and linked together, not just the ones that are actually used. If so, do you also change the semantics for the host code part of the object? What about for host objects that don't contain any bitcode? What about bitcode files for the device that don't have a corresponding host part?

clang/lib/Driver/ToolChain.cpp
579

I'm not really following you here. Longer comment at the top level.

It seems a shame to spawn llvm-ar to make the archive. It's not a very complicated file format and we already have the implementation of llvm-ar in tree. Can we use it as a library instead of forking?

When clang detects the .o files are clang-offload-bundle or host object, it will extract the bitcode and link them and generate ISA, then create a host object containing it, then link them together. We can link multiple objects together to create either a shared library or a static library (with this patch).

The semantics of this patch are a bit unclear to me.

When linking a static archive, the linker searches for each currently unresolved symbol in the archive and links in only the object that exposes that symbol. Unused objects are left in the archive. That's useful behaviour in various settings, and overridden with --whole-archive when all the objects are wanted.

The behaviour I would expect, from a user perspective, is to link in the host code which satisfies unresolved symbols for the host, and the device code that satisfies unresolved symbols for the device, and no extra files, regardless of whether the files are machine code for either arch, bitcode for either arch, or an offload-bundle potentially containing some of each. That's what we have today for mixtures of bitcode and machine code for a single architecture. And extensible in the obvious fashion for > 2 architectures.

That seems implementable - skip over objects for architectures other than the one currently of interest, and reach into clang-offload-bundle if present to do the same thing. Optionally provide a symbol table per arch to make it faster.

I think you're suggesting that the bitcode from every object is extracted and linked together, not just the ones that are actually used. If so, do you also change the semantics for the host code part of the object? What about for host objects that don't contain any bitcode? What about bitcode files for the device that don't have a corresponding host part?

The use case of this patch is that the library only needs to provide host APIs which can be called by C++ programs but do not need to expose device code to the users. Therefore the device code in the library are linked, finalized, and embedded as ISA in host objects and not exposed to library users. We have user requests to support such use case.

There may be use cases that users want to store both host codes and device codes in a static library and expose both to users, however that is out of scope of this patch.

clang/lib/Driver/ToolChain.cpp
579

Even if we can do that in-process, we still need the action and tool to represent that compilation step. Besides, llvm-ar does not support static library used by MSVC. If we want to extend this to MSVC toolchain, we need these API's.

The use case of this patch is that the library only needs to provide host APIs which can be called by C++ programs but do not need to expose device code to the users. Therefore the device code in the library are linked, finalized, and embedded as ISA in host objects and not exposed to library users. We have user requests to support such use case.

There may be use cases that users want to store both host codes and device codes in a static library and expose both to users, however that is out of scope of this patch.

Ah, OK. So it's HIP specific, and not a general purpose static archive. In that case, the many functions with names based on StaticLib should probably all be renamed HIPStaticLib or similar to indicate that it doesn't work for other toolchains. Can any of this code be moved into the HIP.cpp toolchain file?

The use case of this patch is that the library only needs to provide host APIs which can be called by C++ programs but do not need to expose device code to the users. Therefore the device code in the library are linked, finalized, and embedded as ISA in host objects and not exposed to library users. We have user requests to support such use case.

There may be use cases that users want to store both host codes and device codes in a static library and expose both to users, however that is out of scope of this patch.

Ah, OK. So it's HIP specific, and not a general purpose static archive. In that case, the many functions with names based on StaticLib should probably all be renamed HIPStaticLib or similar to indicate that it doesn't work for other toolchains. Can any of this code be moved into the HIP.cpp toolchain file?

It is not HIP specific. The use case is generic and common. Many cases users do not care about device code internals and only interested in using the functionality of a library through usual host APIs. If other toolchains would like to support the same use case they can easily extend it to their own use.

Also the implementation requires introduction of static lib tool in general since to get a tool it has to go through the generic interface of toolchain. There have already been all kinds of Tools and JobActions in the generic interface that are for specific toolchains, e.g. clang-offload-bundler, so this is not something new.

It is not HIP specific. The use case is generic and common. Many cases users do not care about device code internals and only interested in using the functionality of a library through usual host APIs. If other toolchains would like to support the same use case they can easily extend it to their own use.

Regardless of how popular the use case might be, it's not building a static library. It uses the same file format, which seems adequate for the purpose, but with different semantics.

Can you split this patch into a function for building static libraries, which other toolchains could use unmodified, and some HIP specific code which makes use of that function?

Also the implementation requires introduction of static lib tool in general since to get a tool it has to go through the generic interface of toolchain. There have already been all kinds of Tools and JobActions in the generic interface that are for specific toolchains, e.g. clang-offload-bundler, so this is not something new.

No objection to another tool. The objection is to calling it 'static lib', on the basis that it isn't one, and other toolchains may want to build a conventional static library from the clang driver.

Can you split this patch into a function for building static libraries, which other toolchains could use unmodified, and some HIP specific code which makes use of that function?

The current patch already does that. For other toolchains, this option will generate a normal static library, like ar does.

For HIP, it does something extra (link, finalize and embeds device code), then generate a static library for host object like ar does.

Maybe Aaron can add a lit test for using this option for C language.

This patch follows a generic design. It introduces generic option, JobAction, and Tool about creating static library which can be customized for each toolchain.

For linux toolchain, it is just the conventional ar, which is implemented in this patch.

For HIP toolchain, currently it implements a specific behavior to finalize device code and generate a common host static library. It is a static library from the host point of view.

This does not preventing it to be extended to generating other format of static libraries. We can easily introduce some options to generate static library in a different way, e.g. archive of clang-offload-bundler bundles, or clang-offload-bundler bundles of archives.

For other toolchains, they just need to override virtual functions getStaticLibTool and implement their own static lib tool to implement their own command to create static libs.

Can you split this patch into a function for building static libraries, which other toolchains could use unmodified, and some HIP specific code which makes use of that function?

The current patch already does that. For other toolchains, this option will generate a normal static library, like ar does.

Some comments inline in StaticLibTool::ConstructJob

Unrelated to the above, but a consequence of reading the code a few times, there's a lot of duplication with tools::AddHIPLinkerScript that doesn't seem necessary.

clang/lib/Driver/ToolChains/CommonArgs.cpp
1433

This code looked familiar - it seems to have a lot in common with AddHIPLinkerScript just above. Can the common parts be factored out?

1492

There isn't a linker script involved here. Copied & pasted from above?

clang/lib/Driver/ToolChains/Gnu.cpp
350

This function is named (and starts off) generic

373

Then does some things which aren't obviously generic. What's special about -stdlib here?

392

Finally does something that is definitely hip specific

ashi1 marked 7 inline comments as done.May 12 2020, 1:55 PM

Maybe Aaron can add a lit test for using this option for C language.

I can add a lit test for C language.

clang/lib/Driver/ToolChains/CommonArgs.cpp
1433

Could re-factoring be in a separate commit? It may be significant and risky. I think there can be other optimizations made here.

1492

Okay I will fix these comments, the comment can be file generator script.

clang/lib/Driver/ToolChains/Gnu.cpp
350

These two lines were added for the D.Diag error checking below.

373

This is a generic line from the Linker::ConstructJob function

392

There is a check in this function which will skip this function if not HIP host toolchain above.

// If this is not a HIP host toolchain, we don't need to do anything.
if (!JA.isHostOffloading(Action::OFK_HIP))
  return;
ashi1 updated this revision to Diff 263769.May 13 2020, 10:09 AM
ashi1 edited the summary of this revision. (Show Details)

Few cosmetic nits. LGTM in general. I'll leave the approval to @JonChesterfield

clang/lib/Driver/ToolChains/CommonArgs.cpp
1423–1426

You don't need to static_cast in order to do getTriple(), regular ToolChain should do the job.
With the cast removed, folding C.getSingleOffloadToolChain<Action::OFK_HIP>()->getTriple().getArch() == llvm::Triple::amdgcn into assert(). would also let you remove the (void)HIPTC below.

1472

Nit: Would it make sense to this if with the one above, so the file name construction is in one place for both names?

1479

Nit: It's only used once. Fold it into MakeArgString()?

clang/lib/Driver/ToolChains/Gnu.cpp
348–349

Nit: the cast may be unnecessary. Both getDriver() and AddLinkerInput() operate on base ToolChain.

379

This is odd.
We check is the file is writable or not a regular file. So far so good.
But then we issue a message that we can't *remove* the file, even though we didn't try to remove anything yet.
The checks we've done do not necessarily mean that the file is not removable. E.g. a read-only file in a writable directory is removable. On UNIX file unlinking is an operation on a directory, not on the file. There are also files that are not regular files that are also possible to delete. E.g. there are symlinks, device nodes, sockets.

IMO, if we report an error it should be due to an error. In this case we produce the diagnostic for something we didn't even try.

Perhaps just remove these predictive checks altogether and rely on the error returned by fs::remove() below?

JonChesterfield resigned from this revision.May 21 2020, 12:25 PM

I don't like the copy and paste but at least it's contained within HIP specific functions.

The control flow in tools::gnutools::StaticLibTool::ConstructJob doesn't seem good though. It's a generic function that unconditionally calls a hip specific function which happens to return immediately in non-hip cases. That really should be a hip specific function calling the generic one, then doing more hip specific things afterwards.

However, @tra knows the clang code base and conventions rather better than I do, so I'm going to defer back.

tra added a subscriber: echristo.May 21 2020, 1:59 PM

The control flow in tools::gnutools::StaticLibTool::ConstructJob doesn't seem good though. It's a generic function that unconditionally calls a hip specific function which happens to return immediately in non-hip cases. That really should be a hip specific function calling the generic one, then doing more hip specific things afterwards.

+1. Indeed it would be great to reuse existing tool either by making it more flexible, or via extracting common things into a helper.

However, @tra knows the clang code base and conventions rather better than I do, so I'm going to defer back.

I may be too used to the odd things we have to do to deal with GPUs. I've pinged @echristo for the second opinion.

ashi1 updated this revision to Diff 266302.May 26 2020, 12:49 PM
ashi1 marked 5 inline comments as done.

Thank you for your reviews, I've cleaned up the patch, and reduced the size of the StaticLib::ConstructJob.

ashi1 added a comment.May 26 2020, 1:00 PM
In D78759#2049973, @tra wrote:

The control flow in tools::gnutools::StaticLibTool::ConstructJob doesn't seem good though. It's a generic function that unconditionally calls a hip specific function which happens to return immediately in non-hip cases. That really should be a hip specific function calling the generic one, then doing more hip specific things afterwards.

+1. Indeed it would be great to reuse existing tool either by making it more flexible, or via extracting common things into a helper.

There might require a lot of changes to make this more flexible, and making a hip specific function call generic functions. I think it may be out of the scope for this patch.

I think I've got a lot of the same concerns that Jon had. In addition I'd like to see some reasoning and abstraction if we're going to use this for creating archives in general. No need to make windows work, but instead could just return a "we can't do that yet" error instead?

clang/lib/Driver/Driver.cpp
3502–3506

I'm really not a fan of this here. I think this needs to be abstracted a bit.

clang/lib/Driver/ToolChains/CommonArgs.cpp
1398

This needs comments.

1422

I'd really like to avoid a target specific assert here.

1433

You can refactor ahead of time to make this a bit more clear?

ashi1 marked an inline comment as done.May 27 2020, 11:15 AM

I think I've got a lot of the same concerns that Jon had. In addition I'd like to see some reasoning and abstraction if we're going to use this for creating archives in general. No need to make windows work, but instead could just return a "we can't do that yet" error instead?

Hi Eric, this patch is to support users who want to create a static library on the HIP language and ROCm architecture. I will add some comments to the code to make it more clear.

clang/lib/Driver/Driver.cpp
3502–3506

Could you please provide some advice on how to make this more abstract? I am not familiar with this area of the code.

ashi1 updated this revision to Diff 266945.May 28 2020, 10:44 AM

I've refactored the code, and removed the AddHIPLinkerScript function, which can be replaced by these MC directives approach.

ashi1 marked 10 inline comments as done.
ashi1 edited the summary of this revision. (Show Details)
ashi1 added a comment.Jun 1 2020, 12:27 PM

ping - is this refactored version ready for submission? Thanks.

yaxunl added inline comments.Jun 2 2020, 8:21 AM
clang/lib/Driver/Driver.cpp
3502–3506

You may introduce a member function Driver::shouldEmitStaticLibrary() and probably change -hip-emit-static-lib to -emit-static-lib

ashi1 updated this revision to Diff 267916.Jun 2 2020, 10:05 AM
ashi1 retitled this revision from HIP - Add Statically Linked Libraries to Add Statically Linked Libraries.
ashi1 edited the summary of this revision. (Show Details)

Thank you for the tips, I've updated the code with Driver::ShouldEmitStaticLibrary function, and renamed to flag to a more general name, --emit-static-lib.

yaxunl added a comment.Jun 5 2020, 4:44 AM

@echristo Aaron has addressed most of the comments. The major remaining issue is that the general linker
calls AddGenerateObjFileFromHIPFatBinary, which is supposed to be part of HIP
toolchain. This issue is not new, since the existing gnutools::Linker::ConstructJob
already did the same thing. Is it OK to land this patch and leave the refactoring of
AddGenerateObjFileFromHIPFatBinary to a separate patch? Thanks.

ashi1 marked 2 inline comments as done.Jun 16 2020, 12:39 PM

The use case of this patch is that the library only needs to provide host APIs which can be called by C++ programs but do not need to expose device code to the users. Therefore the device code in the library are linked, finalized, and embedded as ISA in host objects and not exposed to library users. We have user requests to support such use case.

There may be use cases that users want to store both host codes and device codes in a static library and expose both to users, however that is out of scope of this patch.

Ah, OK. So it's HIP specific, and not a general purpose static archive. In that case, the many functions with names based on StaticLib should probably all be renamed HIPStaticLib or similar to indicate that it doesn't work for other toolchains. Can any of this code be moved into the HIP.cpp toolchain file?

Hi @JonChesterfield and @echristo , I am opening a follow up commit that will move the linking logic into the HIP Toolchain rather than being in GNU.cpp. I will rebase this on top of master as well.

ashi1 updated this revision to Diff 271173.Jun 16 2020, 12:50 PM

Rebased on master, and added --emit-static-lib to hip-link-save-temps.hip testcase.

ashi1 updated this revision to Diff 271220.Jun 16 2020, 3:05 PM

Removed the HIP Linker changes, that will come in a different patch. Keeping this change more general.

This revision was automatically updated to reflect the committed changes.

This appears to have been committed without addressing all the comments or waiting for an acceptance from someone outside of our organisation. That doesn't seem right - am I missing part of the thread here?

ashi1 added a comment.Jun 29 2020, 7:53 AM

Hi Jon, I may have committed this in combination with D81963 as the parent commit. It should have waited for a LGTM from someone outside our org.

Please let me know unaddressed comments on this change, and I will fix them asap. Sorry about the confusion.