Page MenuHomePhabricator

[CUDA] Invoke ptxas and fatbinary during compilation.
ClosedPublic

Authored by jlebar on Jan 11 2016, 1:20 PM.

Details

Summary

Previously we compiled CUDA device code to PTX assembly and embedded
that asm as text in our host binary. Now we compile to PTX assembly and
then invoke ptxas to assemble the PTX into a cubin file. We gather the
ptx and cubin files for each of our --cuda-gpu-archs and combine them
using fatbinary, and then embed that into the host binary.

Adds two new command-line flags, -Xcuda_ptxas and -Xcuda_fatbinary,
which pass args down to the external tools.

(The -S test removed in cuda-options.cu is added back with some modifications in D16081.)

Diff Detail

Repository
rL LLVM

Event Timeline

jlebar updated this revision to Diff 44543.Jan 11 2016, 1:20 PM
jlebar retitled this revision from to [CUDA] Invoke ptxas and fatbinary during compilation..
jlebar updated this object.
jlebar added reviewers: tra, echristo.
jlebar added subscribers: jhen, cfe-commits.
jlebar updated this object.Jan 11 2016, 1:37 PM
tra edited edge metadata.Jan 11 2016, 2:32 PM

Make sure it works with -save-temps and -fintegrated-as/-fno-integrated-as. They tend to throw wrenches into pipeline construction.

lib/Driver/Driver.cpp
1380 ↗(On Diff #44543)

So, you're treating GpuArchName==nullptr as a special case of DeviceAction for fatbin?
Perhaps it warrants its own CudaDeviceLinkAction?

1620–1625 ↗(On Diff #44543)

cubin *is* an ELF object file. Do we really need a new type here or can we get by with TY_Object?

lib/Driver/ToolChains.cpp
4193 ↗(On Diff #44543)

unneeded {} here and in few more places throughout the patch.

4230 ↗(On Diff #44543)

You may as well move it out of the loop and return early if BoundArch is nullptr.

lib/Driver/Tools.cpp
10621–10625 ↗(On Diff #44543)

CmdArgs.push_back(TC.getTriple().isArch64Bit() ? "-m64" : "-m32");

or, even

ArgStringList CmdArgs = {TC.getTriple().isArch64Bit() ? "-m64" : "-m32"};

Same in Linker::ConstructJob below.

10677–10678 ↗(On Diff #44543)

First line does not parse.

In general compute_XX does not necessarily match sm_XX.
ptxas options says that

Allowed values for this option: compute_20, compute_30, compute_35, compute_50, compute_52; and sm_20, sm_21, sm_30, sm_32, sm_35, sm_50 and sm_52

Note that there's no compute_21, compute_32. You'll need sm_XX -> compute_YY map.

lib/Driver/Tools.h
923 ↗(On Diff #44543)

Please add more details about what fatbin does.
".. which combines GPU object files and, optionally, PTX assembly into a single output file."

echristo edited edge metadata.Jan 11 2016, 2:39 PM

One question inline, one nit, and one more question here: You've got a couple of checks inline for null names/architectures, where do you expect those to come from and can you test for them? Or, another question, is if they're multiple architectures shouldn't we be able to see all of the actions that arise from each gpu?

Or I'm missing something, either way an explanation is good. :)

-eric

lib/Driver/Driver.cpp
1880 ↗(On Diff #44543)

Weren't you just adding it as part of the InputInfo constructor in 16078?

lib/Driver/ToolChains.cpp
4261 ↗(On Diff #44543)

Nit: No braces around single lines.

jlebar updated this revision to Diff 44584.Jan 11 2016, 4:55 PM
jlebar marked 8 inline comments as done.
jlebar edited edge metadata.

Address tra, echristo's review comments.

In D16082#324138, @tra wrote:

Make sure it works with -save-temps and -fintegrated-as/-fno-integrated-as. They tend to throw wrenches into pipeline construction.

Thanks. All of them worked except -fintegrated-as, which was causing us not to invoke ptxas. Fixed, and added a test.

lib/Driver/Driver.cpp
1380 ↗(On Diff #44543)

It's a special case either way, but there are enough places that look for a CudaDeviceAction that I *think* this special case is cleaner than having a bunch of if (CudaDeviceAction || CudaDeviceLinkAction)s floating around.

1620–1625 ↗(On Diff #44543)

Got rid of the extra type. Note that this means that our intermediate objs will be named foo.o instead of foo.cubin, which isn't optimal, but I agree that it's simpler without the extra type.

1880 ↗(On Diff #44543)

Yes, but we're doing something slightly different here. Suppose you have a CudaDeviceAction --> BackendAction. What do you want the InputInfo's Action to be? Without this change, it's the BackendAction. But we really want it to be the CudaDeviceAction. Thus this line.

I updated the comment in an attempt to clarify.

lib/Driver/ToolChains.cpp
4230 ↗(On Diff #44543)

Tried this and discovered it's actually subtly different in a way that, thankfully, affects one of the tests I added.

if (!BoundArch) continue applies only if A->getOption().matches(options::OPT_Xarch__). So we can't hoist it into an early return.

lib/Driver/Tools.cpp
10677–10678 ↗(On Diff #44543)

Oh goodness, how awful. Thank you for catching that. Fixed.

jlebar updated this revision to Diff 44586.Jan 11 2016, 5:11 PM

Add test checking that sm_XX gets translated to compute_YY correctly.

tra accepted this revision.Jan 12 2016, 3:54 PM
tra edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Jan 12 2016, 3:54 PM
echristo accepted this revision.Jan 14 2016, 1:32 PM
echristo edited edge metadata.

This is terrible, but the only other option is fixing bind arch and inverting the graph which is a major rewrite to the driver.

So, LGTM.

-eric

This revision was automatically updated to reflect the committed changes.