This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Implemented additional processing steps needed to link with CUDA libdevice bitcode.
ClosedPublic

Authored by tra on Jul 30 2015, 2:58 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

tra updated this revision to Diff 31079.Jul 30 2015, 2:58 PM
tra retitled this revision from to [CUDA] Implemented additional processing steps needed to link with CUDA libdevice bitcode..
tra updated this object.
tra added reviewers: echristo, jholewinski, eliben.
tra added a subscriber: cfe-commits.
echristo added inline comments.Aug 24 2015, 1:22 PM
lib/CodeGen/CodeGenAction.cpp
166–170 ↗(On Diff #31079)

Can you explain this in a different way perhaps? I'm not sure what you mean here.

181–190 ↗(On Diff #31079)

Seems like this should be part of the normal IPO pass run? This seems like an odd place to put this, can you explain why a bit more?

tra added inline comments.Aug 24 2015, 2:10 PM
lib/CodeGen/CodeGenAction.cpp
166–170 ↗(On Diff #31079)

From llvm.org/docs/NVPTXUsage.html

This patch implements following items:

The internalize pass is also recommended to remove unused math functions from the resulting PTX. For an input IR module module.bc, the following compilation flow is recommended:

1 Save list of external functions in module.bc
2 Link module.bc with libdevice.compute_XX.YY.bc
3 Internalize all functions not in list from (1)
4 Eliminate all unused internal functions

LLVM part of the changes takes care of NVVMReflect:

  • Run NVVMReflect pass
  • Run standard optimization pipeline
181–190 ↗(On Diff #31079)

It will indeed happen during normal optimization, but as NVPTX docs says it makes fair amount of sense to eliminate quite a bit of bitcode that we know we're not going to need. libdevice carries ~450 functions and only handful of those are needed. Why run all other optimization passes on them?

In addition to that, we need to pass to Internalize list of symbols to preserve. As far as I can tell the way to do it within normal optimization pipeline is to pass them to back-end via -internalize-public-api-list/-internalize-public-api-file. That's not particularly suitable way to carry potentially large list of symbols we will find in the TU we're dealing with.

I could move GDCE to LLVM where it would arguably be somewhat more effective if done after NVVMReflect, but keeping it next to internalize makes it easier to see that we intentionally internalize and eliminate unused bitcode here.

echristo added inline comments.Aug 24 2015, 2:46 PM
lib/CodeGen/CodeGenAction.cpp
181–190 ↗(On Diff #31079)

I might not have been clear. I'm curious why all of this isn't just part of the normal IPO pass run that should be happening on the code anyhow? Taking a step back - this should just go through the normal "let's set up a pipeline for the code", which might end up being cuda specific, but should be handled in the same way.

That make sense?

tra updated this revision to Diff 33713.Sep 1 2015, 11:07 AM
tra updated this object.

Updated the patch to use bitcode linker to perform selective linking and internalizing.
Removed Internalize+GDCE passes.

echristo accepted this revision.Sep 1 2015, 11:08 AM
echristo edited edge metadata.

The ternary is a bit ugly, but LGTM. :)

-eric

This revision is now accepted and ready to land.Sep 1 2015, 11:08 AM

Inline comment.

test/CodeGenCUDA/link-device-bitcode.cu
23–28 ↗(On Diff #33713)

It would be better here if we could just check the pass structure and rely on the backend to test that the pass works.

tra updated this revision to Diff 34251.Sep 8 2015, 1:53 PM
tra updated this object.
tra edited edge metadata.

Assume (and test) that NVPTX back-end includes NVVMReflect by default.

Works for me, thanks.

tra updated this revision to Diff 34467.Sep 10 2015, 11:07 AM

Removed unneeded #includes.

This revision was automatically updated to reflect the committed changes.