This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][OpenMP] Create generic offload action
ClosedPublic

Authored by sfantao on Mar 14 2016, 6:29 PM.

Details

Summary

This patch replaces the CUDA specific action by a generic offload action. The offload action may have multiple dependences classier in “host” and “device”. The way this generic offloading action is used is very similar to what is done today by the CUDA implementation: it is used to set a specific toolchain and architecture to its dependences during the generation of jobs.

This patch also proposes propagating the offloading information through the action graph so that that information can be easily retrieved at any time during the generation of commands. This allows e.g. the "clang tool” to evaluate whether CUDA should be supported for the device or host and ptas to easily retrieve the target architecture.

This is an example of how the action graphs would look like (compilation of a single CUDA file with two GPU architectures)

0: input, "cudatests.cu", cuda, (host-cuda)
1: preprocessor, {0}, cuda-cpp-output, (host-cuda)
2: compiler, {1}, ir, (host-cuda)
3: input, "cudatests.cu", cuda, (device-cuda, sm_35)
4: preprocessor, {3}, cuda-cpp-output, (device-cuda, sm_35)
5: compiler, {4}, ir, (device-cuda, sm_35)
6: backend, {5}, assembler, (device-cuda, sm_35)
7: assembler, {6}, object, (device-cuda, sm_35)
8: offload, "device-cuda (nvptx64-nvidia-cuda:sm_35)" {7}, object
9: offload, "device-cuda (nvptx64-nvidia-cuda:sm_35)" {6}, assembler
10: input, "cudatests.cu", cuda, (device-cuda, sm_37)
11: preprocessor, {10}, cuda-cpp-output, (device-cuda, sm_37)
12: compiler, {11}, ir, (device-cuda, sm_37)
13: backend, {12}, assembler, (device-cuda, sm_37)
14: assembler, {13}, object, (device-cuda, sm_37)
15: offload, "device-cuda (nvptx64-nvidia-cuda:sm_37)" {14}, object
16: offload, "device-cuda (nvptx64-nvidia-cuda:sm_37)" {13}, assembler
17: linker, {8, 9, 15, 16}, cuda-fatbin, (device-cuda)
18: offload, "host-cuda (powerpc64le-unknown-linux-gnu)" {2}, "device-cuda (nvptx64-nvidia-cuda)" {17}, ir
19: backend, {18}, assembler
20: assembler, {19}, object
21: input, "cuda", object
22: input, "cudart", object
23: linker, {20, 21, 22}, image

The changes in this patch pass the existent regression tests (keeps the existent functionality) and resulting binaries execute correctly in a Power8+K40 machine.

Diff Detail

Event Timeline

sfantao updated this revision to Diff 50689.Mar 14 2016, 6:29 PM
sfantao retitled this revision from to [CUDA][OpenMP] Create generic offload action.
sfantao updated this object.
sfantao added reviewers: ABataev, jlebar, tra, echristo, hfinkel.
mkuron added a subscriber: mkuron.Mar 19 2016, 1:42 AM
tra edited edge metadata.Mar 23 2016, 2:12 PM

Thank you for making these changes. They don't solve all the shortcomings, but they improve things quite a bit, IMO.

Overall I'm happy with the changes, though use of mutable and changing action state from const functions may need a look from someone with better C++-fu skills than myself.
@jlebar: Justin, any suggestions on what we can/should do regarding that?

tra added inline comments.Mar 23 2016, 2:12 PM
include/clang/Driver/Action.h
100

I assume that by combination you imply that this is a mask of all offloading kinds we intend to use.
Perhaps it should be renamed to reflect it better. ActiveOffloadKindMask?

100

These fields appears to be part of the Action state yet you want to modify them from const functions which seems wrong to me. Perhaps functions that set these fields should not be const.

144

propagateXXX implies modification which contradicts 'const' which implies that there will be no state change. Which one is telling the truth?

296–303

Are these independent or is it expected that all four are populated/modified in sync? If they all need to be updated at the same time (which is what the code below does), it should be documented. Perhaps these arrays could be converted into vector of structs.

308

Can any of parameters be nullptr?

322–328

Cosmetic nit: Naming is somewhat inconsistent. Device dependencies use acronyms, but HostDependence uses mix of acronyms and words. I'd use camelcase words in both.

include/clang/Driver/Driver.h
423

TC is only used to get triple to construct file name prefix. Perhaps just pass that string explicitly.

lib/Driver/Action.cpp
45

Given that these functions change state they probably should not be const.

104

There's no need for ToolChain here. A string is all it needs as an input.

228

returning nullptr if .size() != 1 looks strange. Perhaps there should be an assert.

lib/Driver/Driver.cpp
1444–1448

Perhaps this should be moved down closer to where it's used. Perhaps even inside of if(PartialCompilation ...)

1514

Is toolchain needed for fatbin action?

1958

You could fold both ifs into something like this:

if (auto *DDAP = dyn_cast_or_null<T>(OA->getSingleDeviceDependence()))
2129

It may be worth adding a comment explaining what happens if OffloadDeviceInputInfos.size() != 1.

lib/Driver/Tools.cpp
3834

All we need is a target triple here. Now that we have device offloading info, perhaps we can bypass AuxToolchain and let offloading info provide host or device triple directly. That would render FIXME above obsolete, IMO.

jlebar edited edge metadata.Mar 23 2016, 2:17 PM

@jlebar: Justin, any suggestions on what we can/should do regarding [const correctness issues]?

Using "mutable" to work around const-incorrectness should be a last resort. I've been frustrated by the const-incorrectness of Action as well, but if it's an issue here, I think it needs to be fixed, not hacked around.

sfantao updated this revision to Diff 52879.Apr 6 2016, 6:41 PM
sfantao marked 15 inline comments as done.
sfantao edited edge metadata.

Address Art, Justin and Eric comments.

Hi Art, Justin,

Thanks for the review and feedback! Tried to address your concerns. Let me know other suggestion you may have.

Thanks again,
Samuel

include/clang/Driver/Action.h
100

That's correct. This is meant to query a host action what are all the programing models used in its dependences.

I am now using ActiveOffloadKindMask as you suggested.

100

Yeah... I was not very happy with these mutable members either... So, after revisiting the issue, the best possible solution was to do all the propagation when the action are appended to the action list. For that I keep track of the offloading kinds employed for a given input and compilation and use that to propagate the information to top-level actions.

I will try to abstract that a little better in the offload handler I am proposing in http://reviews.llvm.org/D18172.

Let me know if you still find issues with the approach I am adopting now.

144

I refactored the propagation to work on top of non-const actions. So this is not const anymore.

296–303

Yes, they are meant to be populated in sync. I'll document that in the comment as you suggest.

The reason I am not using array of structs is that it is more convenient to have the action list separate to forward it to initialize the base class.

308

Only BoundArch can be null. I am changing the signature of add() to use reference types for the arguments that must not be null.

322–328

Ok, I am using camelcase in both now. I also noticed I was not using the typedef types in the private members of the dependencies. I fixed that as well.

include/clang/Driver/Driver.h
423

Ok, I am just passing the string now.

lib/Driver/Action.cpp
45

I fixed that.

104

Ok, passing the string with the normalized triple now.

228

This is used when actions get collapsed. In general we can have multiple device dependences, and if so we do not collapse. Therefore this member function was also working as a check.

I'm have changed this to have a 'get' and a 'has' version, and use the assertion in the 'get' version.

lib/Driver/Driver.cpp
1444–1448

I moved it right before the if(Partial...) statement, because it is also used after that.

1514

Yes, it is required to enquire which link tool should be used for the device action.

1958

Given that I am adding a new query to check the existence of the Host or single-device action, I am keeping the two ifs separate. getSingleDeviceDependence now does an assertion. Let me know if you prefer me to do this differently.

2129

I am elaborating on that in the comment now. I also got rid of doOnHostDependence here and I use OA->getHostDependence() that now contains the assertion.

lib/Driver/Tools.cpp
3834

I removed the use of AuxToolChain. It was being used also for the preprocessor argument. I added some new login in there so that the information is extracted from the toolchains owned by compilation.

Let me know if that is what you had in mind.

sfantao updated this revision to Diff 60573.Jun 13 2016, 11:58 AM
  • Rebase.

Any more comments on this one?

Thanks!
Samuel

sfantao updated this revision to Diff 62230.Jun 29 2016, 9:22 AM
  • Better organize how the offload action is used and add more comments to document what is going on.
  • Rebase.

@tra, any other comments/suggestions about this patch?

Thanks!
Samuel

ABataev edited edge metadata.Jun 29 2016, 9:41 PM

No '\brief's

include/clang/Driver/Action.h
207

'final'

286

'final'

324

'final'

330–332

default initializers

354

Default initializer

lib/Driver/Driver.cpp
1944–1966

Three slashes

sfantao updated this revision to Diff 62435.Jun 30 2016, 4:01 PM
sfantao marked 5 inline comments as done.
sfantao edited edge metadata.
  • Mark classes with final and fix comments.

Hi Alexey,

Thanks for the review! I addressed your comments in the last diff.

No '\brief's

When you say "no \brief" do you mean I am using that where I shouldn't or the other way around. I am only adding that for member functions/vars that I create, should I do that for member I am modifying too? I noticed two places where I was using that for static functions, I removed \brief in those cases. Let me know if that is what you wanted me to do.

Thanks again,
Samuel

Hi Alexey,

Thanks for the review! I addressed your comments in the last diff.

No '\brief's

When you say "no \brief" do you mean I am using that where I shouldn't or the other way around. I am only adding that for member functions/vars that I create, should I do that for member I am modifying too? I noticed two places where I was using that for static functions, I removed \brief in those cases. Let me know if that is what you wanted me to do.

Thanks again,
Samuel

Ok, I got you. In my comments the abstract can always be the first sentence, so \brief is not required. I'll remove that as you suggest.

Thanks,
Samuel

sfantao updated this revision to Diff 62460.Jun 30 2016, 8:38 PM
  • Remove \brief.
sfantao updated this revision to Diff 62572.Jul 1 2016, 4:51 PM
  • Rebase
sfantao updated this revision to Diff 63680.Jul 12 2016, 8:15 AM
  • Rebase.
  • Remove static function no longer necessary.

@tra, any more comments about this patch?

Thanks!
Samuel

tra added a comment.Jul 12 2016, 10:34 AM

Few minor nits and suggestions. Other than that I'm OK with the patch.

lib/Driver/Action.cpp
156

Minor style nit -- LLVM coding standard says :

... we strongly prefer loops to be written so that they evaluate it once before the loop starts.

for (unsigned i = 0, e = getInputs().size(); i != e; ++i)

Please check other for loops throughout the patch.

200–207

It could be rephrased as "do work if we have dependencies" and make code a bit more concise.

if (auto *A = DDeps.getActions()[i]) {
   getInputs().push_back(A);
   A-> propagate...
}
213

Please add assert to verify that getInputs() is not empty.
It may be worth doing throughout the patch as there are several places where we indexing into getInputs() result without verifying its size. It's not at all obvious from the code that it's always OK to do so.

219–230

You may want to add an assert that I and TI are both valid within the loop.

sfantao updated this revision to Diff 63850.Jul 13 2016, 1:30 PM
sfantao marked 4 inline comments as done.
  • Add assertions before accessing getInputs() when assumptions are made about the number of elements.
  • Add missing evaluations of end iterator in the beginning of loops.
  • Address other comments by Art.
sfantao updated this revision to Diff 63852.Jul 13 2016, 1:36 PM
  • Add one more assert.

Hi Art,

Thanks for the review! Addressed your comments in the last diff.

Thanks again,
Samuel

lib/Driver/Action.cpp
219–230

I added an assertion for TI. I didn't do that for I though, as it is the exit condition of the loop, so it will be always valid. Let me know if you still want me to add that.

tra added inline comments.Jul 13 2016, 1:54 PM
lib/Driver/Action.cpp
219–230

I don't see any changes in this function in your latest patch. Did you add that assert somewhere else?

I'm not worried about validity of I which is indeed ensured by the loop, but rather want to verify that number of inputs we process and number of elements in DevToolChains match. While running out of TI elements eary would be obviously wrong, I assume that exiting the loop with some remaining TI elements would also be unexpected and that we should assert() that it does not happen.

sfantao updated this revision to Diff 63865.Jul 13 2016, 3:02 PM
sfantao marked an inline comment as done.
  • Modify assertion to check that sizes of input dependences and device toolchains is consistent.
sfantao added inline comments.Jul 13 2016, 3:05 PM
lib/Driver/Action.cpp
219–230

The only change in this function was the assertion inside the loop. Its possible my message got in before the actual diff, sorry about that...

Ok, got it. I'm replacing the assertion in the loop body by an assertion that checks if the sizes of the inputs and toolchains are consistent. Let me know if you'd rather have a check of the iterator inside and after the loop.

tra added a comment.Jul 13 2016, 3:37 PM

The changes look good.

Now we just need some tests. Something along the lines of test/Driver/phases.c should do.

sfantao updated this revision to Diff 63880.Jul 13 2016, 4:26 PM
  • Add test to check the generated phases for CUDA.
  • Fix typo in comment.
tra added inline comments.Jul 13 2016, 4:50 PM
test/Driver/cuda_phases.cu
2

Few words describing the test would be nice to have.

You may also want to add few REQUIRES fields so the test does not break for builds w/o PPC or NVPTX.

// REQUIRES: clang-driver
// REQUIRES: nvptx-registered-target
// REQUIRES: powerpc-registered-target

I wonder if the test need host arch specified at all. I think it should be able to run on any host as long as it supports NVPTX.

sfantao updated this revision to Diff 63889.Jul 13 2016, 5:22 PM
  • Add comments and REQUIRE directives to test.
sfantao marked an inline comment as done.Jul 13 2016, 5:26 PM
sfantao added inline comments.
test/Driver/cuda_phases.cu
2

Oh, ok. Thought that the registration of the targets was checked after the phases generation. I added the REQUIRES and comments in the test.

I'm still using the explicit target architecture in the commands given that it is also being matched in the tests. Let me know if you prefer to use a wildcard in the tests instead.

tra accepted this revision.Jul 13 2016, 5:28 PM
tra edited edge metadata.

LGTM.

test/Driver/cuda_phases.cu
48

architecture*s*.
There are few more copy-pasted cases below.

This revision is now accepted and ready to land.Jul 13 2016, 5:28 PM
sfantao closed this revision.Jul 15 2016, 4:20 PM
sfantao marked an inline comment as done.