Page MenuHomePhabricator

Initial check-in of Acxxel (StreamExecutor renamed)
ClosedPublic

Authored by jhen on Oct 17 2016, 2:38 PM.

Details

Summary

Acxxel is basically a simplified redesign of StreamExecutor.

Here are the major points where Acxxel differs from the current StreamExecutor
design:

  • Acxxel doesn't support the kernel and kernel loader types designed for emission by the compiler to support type-safe kernel launches. For CUDA, kernels in Acxxel can be seamlessly launched using the standard CUDA triple-chevron kernel launch syntax that is available with clang and nvcc. For CUDA and OpenCL, kernel arguments can be passed in the old-fashioned way, as one array of pointers to arguments and another array of argument sizes. Although OpenCL doesn't get a type-safe kernel launch method, it does still get the benefit of all the memory management wrappers. In the future, clang may add support for triple-chevron OpenCL kernel launchs, or some other type-safe OpenCL kernel launch method.
  • Acxxel does not depend on any other code in LLVM, so it builds completely independently from LLVM.

The goal will be to check in Acxxel and remove StreamExecutor, or perhaps to remove the old StreamExecutor and rename Acxxel to StreamExecutor,
so I think Acxxel should be thought of as a new version of StreamExecutor, not
as a separate project.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jlebar added inline comments.Oct 18 2016, 6:29 PM
acxxel/cuda_acxxel.cpp
29 ↗(On Diff #75084)

Does this have to be per-thread rather than per stream? Kind of a bummer that one thread can't operate on two streams for different devices (among other things).

At least maybe setActiveDevice should be called setActiveDeviceForThread.

169 ↗(On Diff #75084)

It looks like this map always has the keys 0, 1, ..., n. So, could we make it a vector? That will make lookups substantially cheaper, although I don't know if we care.

193 ↗(On Diff #75084)

How does this work, exactly? We set this thread's active device index to 0, but new threads have it as uninitialized memory? That seems not so good.

283 ↗(On Diff #75084)

Forgot to convert to seconds.

476 ↗(On Diff #75084)

std::move(Callback)

497 ↗(On Diff #75084)

Could we use std::array so we don't have to do these ARRAYSIZE shenanigans?

jhen updated this revision to Diff 75163.Oct 19 2016, 9:56 AM
jhen marked 4 inline comments as done.
  • Respond to jlebar's comments on cuda_acxxel.cpp
jhen added inline comments.Oct 19 2016, 10:00 AM
acxxel/cuda_acxxel.cpp
29 ↗(On Diff #75084)

TL;DR the CUDA driver library forces us to do it per thread or to suffer a performance penalty.

I chose it to be per thread for efficiency reasons. The old StreamExecutor tried to do it per stream, and this led to extra library calls for every operation to check that the current device was equal to the one expected by the Stream. Ideally we would just store a "context" in the stream, the context would know which device it belonged to, and the context would be used to launch all work, but unfortunately that is not how the CUDA driver library works. Instead, the CUDA driver library works a lot like this where it allows you to set the active context, but then you don't pass the context to each method call, instead, it just uses the active context.

It actually doesn't prevent a thread from using streams from different devices. It just forces that thread to do the "active device" bookkeeping on its own. Want to use stream0? First set the active device to device0. Want to go back to stream1? Remember to set the active device back to device1.

I did change the name as you suggested, though.

193 ↗(On Diff #75084)

Good point. I want every thread to start with active device set to zero because that's the way the CUDA runtime does it. I moved this initialization to the declaration of ActiveDeviceIndex. Based on some experiments and my interpretation of the standard, that should do the right thing.

jhen updated this revision to Diff 75167.EditedOct 19 2016, 10:19 AM
  • Fix deleted Span container constructor

I had forgotten to fix a detail of the Span class implementation before posting this patch. This patch cleans up that detail.

This all looks really good to me.

acxxel/cuda_acxxel.cpp
29 ↗(On Diff #75084)

I chose it to be per thread for efficiency reasons.

OK, I am totally onboard with that -- zero-cost abstractions.

But maybe we can have assert()s that ensure that users are always switching to a stream's device before touching that stream? If we have to have an ifndef NDEBUG variable or two inside the stream class to support this, that's fine by me.

We also need to have comments explaining this weird requirement.

acxxel/examples/opencl_example.cpp
56 ↗(On Diff #75167)

Nit, we can get away with single braces here, which I think is more natural.

59 ↗(On Diff #75167)

If we're using assert(), we should #error if we're built with NDEBUG, or alternatively #undef NDEBUG as the first thing in this file (before any #includes).

acxxel/opencl_acxxel.cpp
24 ↗(On Diff #75167)

I wonder if we need this extra namespace (same for cuda). We don't actually expose anything other than the make-platform function in it. Do we plan on exposing anything else?

39 ↗(On Diff #75167)

Do you want this to be explicit?

45 ↗(On Diff #75167)

Same for CUDA: Using TLS like this means that it doesn't ever make sense for us to have two platform objects going. Which is probably correct for other reasons, too.

So maybe instead of having createFoo functions, we should have getFoo functions that return a singleton (created lazily, using static Foo* = new Foo).

I know this is basically what the current string-to-platform function is trying to do, but I guess the suggestion is to push the singleton-ness down into the lower-level factory functions, since singletonness is necessary for correctness.

73 ↗(On Diff #75167)

This could live inside the anon ns? That's a potential speedup, too, since it lets the compiler make assumptions about this class that it can't otherwise.

Same for CUDA, if applicable (I didn't check).

176 ↗(On Diff #75167)

s/100/MaxNumEntries/?

214 ↗(On Diff #75167)

It's a nit, but I'm not sure we need this first if statement?

535 ↗(On Diff #75167)

Did we decide to make streams single-threaded to avoid this overhead?

If so, in addition to getting rid of the mutex, we should get rid of the comment about concurrently enqueueing an event, since it's not valid to touch the stream concurrently at all.

acxxel/span.h
10 ↗(On Diff #75167)

I really don't like these top-level comments that say nothing you couldn't have guessed from the filename, even though we have them all over LLVM.

Does having this really help doxygen? If so, would it make doxygen look awful if we moved the comment currently on the span class into this block, so that it actually provided some value?

26 ↗(On Diff #75167)

Can we make this constexpr?

59 ↗(On Diff #75167)

I am really surprised that this is what they suggest in the proposal. It's fine for now, if it becomes a problem we can turn it into an assert().

94 ↗(On Diff #75167)

Nit, could we just use data() in the decltype? We're going to use it in the constructor below anyway. Or is this as specified and the spec is just weird?

195 ↗(On Diff #75167)

I cannot figure out the reason that some of these functions are noexcept but not others.

acxxel/status.h
22 ↗(On Diff #75167)
34 ↗(On Diff #75167)

s/an/a/, and below.

57 ↗(On Diff #75167)

Can we not =default these four?

82 ↗(On Diff #75167)

Since we're going to take ownership of the status, we should take it by value, rather than const ref (and then std::move it into TheError).

This will make it work efficiently if we're passed an rvalue.

94 ↗(On Diff #75167)

We can do the same thing with Value -- just take it by value and std::move it.

154 ↗(On Diff #75167)

dblaikie taught me today that we could also take Expected by value in our operator=. This would let us have
just one overload instead of two, kind of nice.

acxxel/tests/acxxel_test.cpp
359 ↗(On Diff #75167)

Can we use proper synchronization operations, like a condvar? volatile doesn't necessarily do what we want on all platforms -- just sort of happens to work on x86.

362 ↗(On Diff #75167)

Wait...we have EXPECT*? Can we use this in the tests where we're currently using assert()s?

acxxel/tests/span_test.cpp
24 ↗(On Diff #75167)

Nit, maybe we should call this arraySize, since it only works on arrays.

37 ↗(On Diff #75167)

Nit, not sure why we have a variable for this. If you just want to make it clear what the second arg to the Span constructor means, you could write

Span(nullptr, /* size = */ 0);

I believe we even have a clang-tidy check to check comments in this style.

79 ↗(On Diff #75167)

Nit, should be consistent wrt we have an explicit length in the array or not. I kind of prefer not. (This appears in other files as well.)

86 ↗(On Diff #75167)

Nit, prefer constructing std::arrays with = {1, 2, 3}, as it's more natural. Also, double-braces have this truly awful footgun, which has permanently scarred those of us who have rediscovered it ourselves: https://youtu.be/rNNnPrMHsAA?t=11m15s

jhen updated this revision to Diff 75659.Oct 24 2016, 4:47 PM
jhen marked 21 inline comments as done.
  • Remove asserts in OpenCL example
  • Respond to jlebar's OpenCL, util comments
jhen added a comment.Oct 24 2016, 4:47 PM

In addition to responding to jlebar's posted comments, I also removed the acxxel::getPlatform function and replaced it with the two functions acxxel::getCUDAPlatform and acxxel::getOpenCLPlatform. I also added a comment to explain that the CUDA and OpenCL platforms are available out of the box with Acxxel, but that other platforms can be created as well. The old acxxel::getPlatform function made it confusing to think about how to add a new platform because it seemed like the new platform should also be registered somehow to put in on equal footing with CUDA and OpenCL. I hope this new design will be clearer in this aspect.

acxxel/cuda_acxxel.cpp
29 ↗(On Diff #75084)

I don't think that assert() calls are useful to check this because the platform libraries themselves (CUDA, OpenCL, etc.) will emit errors for these cases.

I added comments to Stream to explain its associated device index and caution about using it when the active device index doesn't match, and I added a function to query the device index for a Stream.

acxxel/examples/opencl_example.cpp
56 ↗(On Diff #75167)

I blame -Wall -Wextra for this. I changed it to -Wall -Wextra -Wno-missing-braces to get rid of the warning for this case.

59 ↗(On Diff #75167)

I changed assert to "check and exit" to prevent having to deal with this.

acxxel/opencl_acxxel.cpp
24 ↗(On Diff #75167)

I mostly got rid of these namespaces, but, as part of another design change, I moved the platform creation (now getter) functions into them.

39 ↗(On Diff #75167)

Yes, I just forgot to mark it. Thanks for catching that.

535 ↗(On Diff #75167)

Thanks for catching this. I got rid of the mutex and got rid of the whole CLEvent struct, since it is now just a cl_event*.

I also got rid of some comments for addCallback that talked about accessing the same stream from multiple threads, but I didn't see a comment that talked about Event objects being enqueued on streams concurrently. Please let me know if I didn't get rid of a comment I should have.

acxxel/span.h
10 ↗(On Diff #75167)

I don't like them either. They don't help with Doxygen as far as I know. I just put them in because I thought they were required for LLVM style. I've removed most of them now, but I did leave a few that I thought were actually serving a purpose.

59 ↗(On Diff #75167)

Sounds good. Like you say, it should be easy to change.

94 ↗(On Diff #75167)

I did this to try to make sure that the container supported operator[] (the proposed spec suggested I should check for this). Do you think there is a cleaner way to do that?

195 ↗(On Diff #75167)

I don't understand it either. In the class comment I added a link the spec document I used. That's where I got all these signatures.

acxxel/status.h
22 ↗(On Diff #75167)

Unfortunately gcc only seems to support this as a function attribute (and that seems consistent with the documentation you cited). When I tried to use it as a class attribute, I got a compilation error.

34 ↗(On Diff #75167)

Oops. Status used to be called Error.

57 ↗(On Diff #75167)

I learned something sad while trying to do this. Status has a std::string field, and it seems that the move assignment operator for std::string is not marked noexcept in my standard library. This means that the default move assignment operator for Status is not noexcept and I get a compile error if I try to declare it noexcept and default.

To deal with this, I chose to retain the noexcept and to re-implement the default move assignment operator. I was able to default the other 3 with my standard library, so I did that.

94 ↗(On Diff #75167)

I think I understood you correctly here. I collapsed the const T &Value and T &&Value constructors into one constructor taking the argument by value and moving it into TheValue. Let me know if I was only supposed to replace one of these constructors that way.

acxxel/tests/acxxel_test.cpp
362 ↗(On Diff #75167)

I couldn't find the places I am using assert() in tests. I think you might be referring to the stuff in the examples directory. That stuff doesn't depend on gtest, so it uses assert() instead.

acxxel/tests/span_test.cpp
37 ↗(On Diff #75167)

This is a fun one. If I pass in a literal 0 it tells me the call is ambiguous because Span<int> also has a constructor with the signature Span<int>(int *FirstElem, int *LastElem). It seems that C++ is helpfully willing to convert a literal 0 to an int * (I'm sure it has to do with NULL and C compatibility).

I still thought it was a useful test, though, because the size argument could be stored in a variable as I have done in my test.

79 ↗(On Diff #75167)

I think I got all the places where I was doing this.

I am happy with this except for my concern about the status handling on sync.

Would it be worth updating (some of?) the examples to the fluent interface?

acxxel/acxxel.cpp
25 ↗(On Diff #75659)

These namespaces seem redundant with the function names? What are you trying to achieve by keeping these out of the root acxxel namespace? (I suspect you have a good reason...)

acxxel/acxxel.h
227 ↗(On Diff #75659)

"platform's active device index for the thread" is awkward. This was hard for me to rephrase, but I think maybe it's better if we move the notion that it's platform-specific to the front of the paragraph, so we don't have to repeat it. Something like

Each Platform has a notion of the currently active device on a particular thread (see Platform::getActiveDeviceForThread and Platform::setActiveDeviceForThread).  Each Stream is associated with a specific, fixed device, set to the current thread's active device when we create the Stream.  Whenever a thread enqueues commands onto a Stream, its active device must match the Stream's device.
246 ↗(On Diff #75659)

Upon reconsideration, I am not entirely sold on how we're handling errors.

If you use the fluent API, you might reasonably write code like:

Status s = stream.do_something().copy_async().sync();
if (!s.ok()) return s;

But this is subtly wrong. It will usually do what you want (return errors that occurred in any of the three operations), but if we happen to get context-switched in between the first two operations, an error from the first one may be reported when we launch the second one. In which case the sync may return OK.

(You can write effectively the same code with the non-fluent API as well.)

This problem is exacerbated by the fact that Status is now a warn-if-unused type, so if someone writes the above code and *doesn't* look at the return value, we'll warn them to!

We could fix this by making sync() return void. Alternatively, we could make sync() implicitly do a takeStatus() (we'd want to comment this). I kind of like the latter, because, presuming that you compile with clang and synchronize your streams, this makes it hard to forget error checking. We could add the warn_unused attribute to this function and then get warnings with gcc, too. (You could do that in a separate patch if you wanted.)

What do you think?

acxxel/cuda_acxxel.cpp
536 ↗(On Diff #75659)

Storing it as a value type here will cause the destructor to run on program shutdown. :)

Now that I see this I like even more that we pushed the encapsulation down to these functions, because it gives you the flexibility to run code on program shutdown, or not, depending on what your platform needs.

acxxel/examples/opencl_example.cpp
59 ↗(On Diff #75167)

Oh, I didn't realize that this wasn't a test, duh. Sorry about that.

As you have it is fine, or asserts are also fine, since it's just an example.

acxxel/opencl_acxxel.cpp
535 ↗(On Diff #75167)

You got the comments I was after; we're good.

acxxel/span.h
94 ↗(On Diff #75167)

If the spec says to check for operator[], I'm down with this as-is.

acxxel/status.cpp
17 ↗(On Diff #75659)

As far as I can tell from cppreference, it's just not noexcept period.

Any reason you don't want these in the header file?

acxxel/status.h
104 ↗(On Diff #75659)

Unnecessary now that we're taking That by value.

22 ↗(On Diff #75167)

Aha, I see. That's cool, I didn't realize that clang let you warn_unused_result on a class.

Maybe this macro should have a slightly different name, to address my point of confusion. Then maybe we can have a comment to the effect of your response above?

57 ↗(On Diff #75167)

OK, I think that's safe because std::string's move constructor doesn't throw, as I read http://en.cppreference.com/w/cpp/string/basic_string/basic_string.

94 ↗(On Diff #75167)

Yep. It's slightly less efficient because we move value twice instead of just once, but I seriously doubt it's going to matter.

acxxel/tests/acxxel_test.cpp
352 ↗(On Diff #75659)

Because we're now reading and writing it from inside a mutex, we don't need GoFlag to be an atomic. (I mean, I suppose it doesn't hurt...)

MarkerFlag is read outside a mutex, so it still needs to be an atomic, or otherwise we could just guard everything with the one mutex -- that seems like the simplest approach to me.

acxxel/tests/span_test.cpp
37 ↗(On Diff #75167)

Hahaha, okay then. :)

jhen updated this revision to Diff 75737.Oct 25 2016, 11:04 AM
jhen marked 6 inline comments as done.
  • Respond to jlebar's comments 2016-10-24
jhen added a comment.Oct 25 2016, 11:04 AM

Would it be worth updating (some of?) the examples to the fluent interface?

Good idea, I updated both of them.

acxxel/acxxel.cpp
25 ↗(On Diff #75659)

Maybe not a good reason, but the reason I did it was because I also defined functions with the same name below. The functions below are wrappers for these functions, and the wrappers are responsible for returning an error if the library was compiled without support for the given platform.

To try to remove the redundancy, I changed both of these function names to just getPlatform and left the namespaces in place to distinguish which platform they are for. I'm happy to rename these functions to anything else, but I just want to preserve the public names getCUDAPlatform and getOpenCLPlatform for the wrappers.

acxxel/acxxel.h
227 ↗(On Diff #75659)

Thanks! As you can tell, I was having a tough time explaining things there.

246 ↗(On Diff #75659)

sync() is already doing an implicit takeStatus(), as you suggested. I didn't realize it wasn't mentioned in the comments. I've fixed the comments now.

I also added warn_if_unused the Stream functions returning a Status so that gcc can produce nice warnings too.

acxxel/cuda_acxxel.cpp
536 ↗(On Diff #75659)

Now storing it as a raw pointer instead. Same change for OpenCL.

acxxel/status.cpp
17 ↗(On Diff #75659)

Changed the comment simply to say that the move assignment operator for std::string is noexcept.

No good reason not to have these in the header. I moved them there.

acxxel/status.h
22 ↗(On Diff #75167)

I changed the name to ACXXEL_WARN_UNUSED_RESULT_TYPE and added a comment explaining why it only works with clang.

jlebar accepted this revision.Oct 25 2016, 11:42 AM
jlebar edited edge metadata.

\o/

acxxel/cuda_acxxel.cpp
546 ↗(On Diff #75737)

Can we just return MaybePlatform now, or does that not work because the types don't match? If so, maybe we should actually have a constructor on Expected which does this type conversion. (unique_ptr has something similar.) But we could do that in a separate patch.

This revision is now accepted and ready to land.Oct 25 2016, 11:42 AM
jhen updated this revision to Diff 75774.Oct 25 2016, 1:02 PM
jhen edited edge metadata.
  • Add ctors for Expected(Expected<U>)
jhen added inline comments.Oct 25 2016, 1:05 PM
acxxel/cuda_acxxel.cpp
546 ↗(On Diff #75737)

I added the templated copy constructor to make this work (and the corresponding templated move constructor while I was at it), so now we are just returning MaybePlatform.

One thing I wasn't sure about when adding the templated constructors was whether I should remove the corresponding non-templated constructors because they now seem redundant. I didn't remove them, though, because std::unique_ptr seems to keep both the templated and non-templated version.

jlebar added inline comments.Oct 25 2016, 1:14 PM
acxxel/cuda_acxxel.cpp
546 ↗(On Diff #75737)

Yes, I think it probably makes sense to keep the non-templated constructors, although I am not 100% sure if they're necessary, like you say.

This revision was automatically updated to reflect the committed changes.