This is an archive of the discontinued LLVM Phabricator instance.

Initial check-in of Acxxel (StreamExecutor renamed)
ClosedPublic

Authored by jhen on Oct 17 2016, 2:38 PM.

Details

Summary

Acxxel is basically a simplified redesign of StreamExecutor.

Here are the major points where Acxxel differs from the current StreamExecutor
design:

  • Acxxel doesn't support the kernel and kernel loader types designed for emission by the compiler to support type-safe kernel launches. For CUDA, kernels in Acxxel can be seamlessly launched using the standard CUDA triple-chevron kernel launch syntax that is available with clang and nvcc. For CUDA and OpenCL, kernel arguments can be passed in the old-fashioned way, as one array of pointers to arguments and another array of argument sizes. Although OpenCL doesn't get a type-safe kernel launch method, it does still get the benefit of all the memory management wrappers. In the future, clang may add support for triple-chevron OpenCL kernel launchs, or some other type-safe OpenCL kernel launch method.
  • Acxxel does not depend on any other code in LLVM, so it builds completely independently from LLVM.

The goal will be to check in Acxxel and remove StreamExecutor, or perhaps to remove the old StreamExecutor and rename Acxxel to StreamExecutor,
so I think Acxxel should be thought of as a new version of StreamExecutor, not
as a separate project.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Due to the shift in emphasis away from supporting type-safe kernel launches and the movement of streams from being the central programming entities,

Can you please discuss the motivation for this?

jhen added a comment.Oct 17 2016, 3:10 PM

Due to the shift in emphasis away from supporting type-safe kernel launches and the movement of streams from being the central programming entities,

Can you please discuss the motivation for this?

When hammering out the finer details of kernel loader objects from the original StreamExecutor implementation, it was brought to my attention that templated CUDA kernels would be a big problem for that model. Delving into the internal usage at Google, we also realized that basically nobody was using the StreamExecutor kernel launch style for CUDA, but folks were instead using the CUDA triple-chevron notation. Since triple-chevron supports templated kernels, is supported by clang for CUDA, and (I think) can be extended to support OpenCL, I think it is better to move forward with triple-chevron as the kernel launch model, and to make it work as seamlessly with StreamExecutor as possible.

I don't want this current choice to preclude adding in more general type-safe kernel launch support to this library in the future, but it doesn't seem like StreamExecutor has the right model for that at this time. With Acxxel, I think we will have a good place to experiment with what the right model will be, and maybe that means extending triple-chevron launches to more general accelerator platforms. Meanwhile, Acxxel is still extremely useful as a modern C++ wrapper for host-side accelerator management.

For those who view platform independence as the main feature of StreamExecutor, I think that not much has changed during this shift of focus in StreamExecutor. Acxxel also supports CUDA kernel launches with the same type-unsafe interface that it provides for OpenCL, so by using this interface the host code will only have to be written once to support both platforms.

One major benefit of the Acxxel model over StreamExecutor is that Acxxel code can be compiled by nvcc, whereas StreamExecutor required support from the compiler. This provides an easy route for users of nvcc to try out Acxxel and get platform independence and nvcc compatibility at the same time.

As for the change in the style of dealing with streams, I mentioned a bit in my comments that without fluent kernel launches, it loses most of its benefit. It is also awkward to convert OpenCL and CUDA runtime error handling to work well with this fluent stream interface.

jlebar edited edge metadata.Oct 17 2016, 5:21 PM

Got through the header file, now my eyes are starting to glaze over. Will come back to this tomorrow. This looks really good, though.

The one thing I really miss from the old SE is not checking the error at every line. I wonder if we could say that errors carry forward just like they used to? Or maybe they do actually carry forward and I don't need to have an error check on every line -- I haven't gotten to the implementation yet. :)

acxxel/acxxel.h
58

Is this true? Like Acxxel exposes a different interface from the standard CUDA runtime library, so maybe it's not a "drop-in" replacement. But maybe it's a "replacement", and maybe we should say it doesn't require libcuda.

105

Maybe we can say, "Function type used to destroy opaque handles given out by the Platform." or something.

115

It's not obvious to me what these three mean; maybe they should have comments?

120

You know, I thought I knew what this meant, but now I'm not sure.

127

Same for these three.

149

Maybe, "Gets a pointer to the platform with the given name" or something, so it's clear that there can be multiple platforms.

155

Can we expand on this a little bit? Maybe, "All operations on a Stream are serialized."

178

Nit, s/and/so/ or something.

194

Which function -- the callback, or addCallback()? (I am legitimately unsure.)

196

Why not just take std::function<void(Stream&, const Status&)> here?

That would give us better error messages, and it seems like it would simplify a lot of the code in the cc file. It's also a potential speedup -- for example, if Callback is a function pointer, creating an std::function from it doesn't necessarily allocate, whereas the implementation below always calls malloc.

Maybe I'm missing something.

204

Can we be more specific about the allowed types here? It looks like they need to be convertible to something or other. Same elsewhere where we take universal references to "memory things".

260

Here too it might be helpful to have a bit more verbiage here telling me what this is good for.

270

This strikes me kind of as a weird name; the stream "records" the event? But I don't have a better suggestion offhand, and if it's consistent with prior art, maybe it's fine.

280

If we're giving out floating-point numbers anyway, could we give seconds instead? That's a more natural unit.

292

Should this have a comment? I can't figure out what this is.

347

Maybe s/the index of//. It's pretty clear from the signature how the number corresponds to a device, whereas as written I'm kind of confused.

366

llvm/clang's style is to leave off braces here. Are we adopting a different style? I don't really care myself, but you'll probably get grumbles.

371

I have no idea what this is doing; can we expand upon this some?

396

You only care about the type of *DeviceSrc.data() and *DeviceDst.data(), right? Maybe we could just use decltype instead of ::value_type, so we'll be compatible with more types.

(If you don't want to type out the decltype expression over and over, I suppose we could even have our own traits class. But that may be overkill.)

479

Probably worth adding a comment indicating that if DeviceSrc's size is less than ElementCount, we copy only the first ElementCount elems -- I wasn't sure if that would be an error or what.

481

Is this better than requiring us to go through a span? That would shrink the API a bit, and (I would hope) make it more obvious what happens when ElementCount doesn't match DeviceSrc's size.

539

Similar comment here.

599

Do we actually care about operator[] as opposed to the data function? That's surprising if so... Perhaps we should explicitly write down our "concepts" somewhere.

At least let's add a comment to the header indicating what "concept" Container needs to fulfill.

611

Since the device one doesn't call T::T() but this one does, maybe we should say that.

614

Could we call this newAsyncHostMem? That would match registerHostMem, and also just be easier to understand, I think.

628

Hm, requiring the program to be stored in an std::string may be a problem. Could we accept a Span<char> instead?

644

I am unclear how this kernel-launching stuff relates to the comment at the top of file saying we don't do this... Is this for opencl only?

714

Not sure this is a good name, because it makes it sound like it's just calling malloc, when in fact it's also registering the memory.

758

Nit, it seems kind of inconsistent that we declare these out-of-line but the sync versions inline.

934

Nit, const is an adjective, so "of const elements" or something?

964

Remind me why we can't do these inside the class definition?

1012

Hm, this is two indirect calls, one to get the function pointer, and the other to call it.

We could make this just one virtual call, but that would be inconsistent with how you do literally everything else here. So I guess this is fine as-is. If it becomes a performance issue, we can address it then.

1030

This needs noexcept? That's...interesting. Maybe deserves a comment.

1036

Oh, this is the conversion that lets us pass it to a <<<>>> launch? But this pointer is not valid on the host? (Never? Or, it's dereferenceable if we are using unified memory?)

If so, or probably also if not, a comment would be helpful here.

1046

Interesting -- I presume you're doing this because that's what std::span does? Does std::span not have a way to slice that doesn't do a bounds check?

Anyway I think this is very likely fine for our use-case.

1078

Do we need "explicit" here? It makes a subtle difference, but unless you're going for it, maybe we just leave it off?

1095

Probably worth saying, "scope, but does not free itself."

1161
T *Memory = new (MaybeMemory.getValue()) T[ElementCount];

Oh, hm. Are you sure this is the right way to undo your allocation above? Like, we don't have to somehow delete the placement new'ed array itself? (I don't see a way to do it on cppreference.)

jhen updated this revision to Diff 75048.Oct 18 2016, 11:18 AM
jhen marked 37 inline comments as done.
jhen edited edge metadata.
  • Respond to jlebar's comments
  • Keep track of first error status per thread
jhen added a comment.Oct 18 2016, 11:19 AM

The one thing I really miss from the old SE is not checking the error at every line. I wonder if we could say that errors carry forward just like they used to? Or maybe they do actually carry forward and I don't need to have an error check on every line -- I haven't gotten to the implementation yet. :)

You were right that it did require checking the error at every line. To address this, I've added in a thread_local variable to keep track of the first error status. With this, users can make as many calls as they want without checking the error, then they can do a final check that nothing went wrong.

What do you think?

acxxel/acxxel.h
58

I changed it to "modern replacement". Unfortunately it does require libcuda and it also requires libcudart right now to access pointers for device variables. We can maybe get rid of the libcudart dependency for clang in the future by including a real drop-in replacement for libcudart, but it's not at that point now.

115

I removed these and the config structs below because they were too CUDA-specific, and I don't think they were very helpful anyway.

194

Now explicitly says addCallback to clarify.

270

It's CUDA's name for the operation, but I don't like it either. It's misleading, so I'm renaming it to "enqueue".

396

Actually, I only want DeviceSrcTy and DeviceDstTy to be either DeviceMemory<T> or DeviceMemorySpan<T>, neither of which has a data member function, so I think this is doing the right thing.

479

Function deleted, as suggested below.

481

We can get rid of this.

539

Deleted this function.

599

I changed this to checking data instead, and I added a comment to try to describe what is expected of Container.

It turns out I was using this for arrays as well as other containers. That doesn't work with data member function, so I added an overload for arrays as well.

644

I can't seem to find the comment at the top of the file that you mention here. Can you be a bit more specific about which comment you mean?

714

Changed the name to rawMallocRegisteredH.

758

I moved the sync copies out of line as well.

964

clang-tidy doesn't like it. clang-tidy wants them to be noexcept, but it doesn't accept the noexcept annotation if they are declared inline. I think you had some ideas about how this might make sense because default inline means that it might not be created, but default out of line was more strict.

1012

I don't think it's a big deal that it's a bit inconsistent with the other destructors because DeviceMemorySpan doesn't own its memory like the others. I switched it to using just a single virtual call.

1030

I think that got put there by accident. It's gone now.

1046

Yes, I just copied the span API here. I didn't see any way for it to go without bounds checking.

1078

It was just a leftover from the development process. It is gone now.

1161

My understanding is that the destructor of the ThePointer member will be called when this function ends. The ThePointer member is a std::unique_ptr with a custom deleter, and the custom deleter will call out to the platform to "free" the memory owned by ThePointer. So I don't think there is any memory leak here.

You were right that it did require checking the error at every line. To address this, I've added in a thread_local variable to keep track of the first error status. With this, users can make as many calls as they want without checking the error, then they can do a final check that nothing went wrong.

Should it be thread_local, or local to the stream itself? (Are streams even thread-safe? I would have assumed not, but if they are, we should comment that.)

acxxel/acxxel.h
115

So how do we access those configuration knobs now? Or, is that functionality just not available to users?

296

What is the purpose of having separate program and kernel classes? It seems like a kernel is just a program plus a name, so could we just merge the two? If not, maybe we can briefly explain why they are separate concepts.

463

Perhaps it would be consistent per elsewhere to take a Span? I don't really care either way, though; if you like it like this, that's fine.

644

Sorry, I think I was thinking of the commit message.

Question stands, though, wrt exactly what we're saying our library does. We can launch kernels on both CUDA and OpenCL, but only through this limited interface?

1161

Definitely no memory leak. The question is more whether C++ requires us to do an "array placement delete" to match the "array placement new", or whether it's OK to in-place destroy the elements piecewise like we do here.

Judging from the random people on the Internet here, it seems like using array placement new for this is dicey and may actually be a buffer overflow. Maybe we should just construct the elements one at a time? Then this destruction code is clearly correct.

http://stackoverflow.com/questions/8720425/array-placement-new-requires-unspecified-overhead-in-the-buffer

jhen marked 2 inline comments as done.Oct 18 2016, 12:45 PM

You were right that it did require checking the error at every line. To address this, I've added in a thread_local variable to keep track of the first error status. With this, users can make as many calls as they want without checking the error, then they can do a final check that nothing went wrong.

Should it be thread_local, or local to the stream itself? (Are streams even thread-safe? I would have assumed not, but if they are, we should comment that.)

Streams are thread safe. This is something I carried over from the original StreamExecutor. I added a comment to the Stream, Platform, Event, Program, and Kernel classes indicating that they are all thread-safe.

I could make the status local to a Stream, but none of the synchronous operations are launched via Streams, they are launched via Platforms instead. So, I could make a status that is local to a Platform, and another one that is local to a Stream. I feel like that could get a bit confusing for users.

Or, perhaps it would be less confusing to have one Status for the whole library. That actually seems pretty nice. What do you think?

acxxel/acxxel.h
115

The functionality is no longer available to users. We can add support back in later if needed.

296

The important difference is that a program can contain multiple kernels because the input source for the program can define multiple kernels. I added a comment to the Program class to explain this.

I don't think we want to merge Program with Kernel because, while it's true that a Kernel is basically just a program plus a name, it's not true that a Program is just a Kernel minus a name.

644

We can launch kernels on both CUDA and OpenCL through this type-unsafe interface.

Additionally, for CUDA, we can pass in our Stream and DeviceMemory types to triple-chevron kernel launches as compiled by clang or nvcc.

1161

Oh I see! Yes, that is scary. OK, I switched to calling placement new for each element individually rather than dealing with placement new for an array.

jhen updated this revision to Diff 75055.Oct 18 2016, 12:45 PM
jhen marked an inline comment as done.
  • Respond to jlebar's comments 2
jhen added a subscriber: eliben.Oct 18 2016, 1:33 PM
jhen updated this revision to Diff 75084.Oct 18 2016, 3:59 PM
  • New error handling in stream
  • Reorganize kernel launch code
  • Move enqueueEvent to Stream
jhen added a comment.Oct 18 2016, 4:04 PM

In my latest patch I responded to jlebar's comments about error handling. The new model in this patch is to have each Stream own its own error state, as was done in StreamExecutor. There is now a function to query the state of the Stream, and all the enqueuing functions that used to return Status now return Stream& instead. This means the fluent Stream launching interface is back as it was in StreamExecutor. Maybe we'll keep the name StreamExecutor for this new thing instead of calling it Acxxel in the end, but we'll keep it as Acxxel for now, at least to distinguish it from the old StreamExecutor code.

This new patch also moves some methods to places where they belong. Kernel launching and event enqueuing are now methods of Stream. The createKernel function is now a method of Program.

jhen updated this object.Oct 18 2016, 4:09 PM
jhen updated this object.Oct 18 2016, 4:25 PM

Comments on cc file for header.

acxxel/acxxel.cpp
33

Nit, Can we comment that this is case-insensitive?

37

Using a unique_ptr here means that when the program shuts down, it will run the Platform's destructor.

This is usually something we avoid in large projects because these destructors are expensive and usually not actually useful.

If we wanted to avoid running the destructor here, we could just make this into an Expected<Platform*> and unwrap the unique_ptr. I am not sure if you want that, though -- there may be actual useful work you want to do on shutdown.

79

std::move(Callback)

acxxel/acxxel.h
271

Let's expand a bit on this comment, specifically that these block the host until all previously-enqueued work on the stream plus the copy completes.

281

It might be worth calling these "syncCopyFoo" so there's no confusion about what they do.

296

Oh, I see! The comments clarify it for me, thanks.

jhen updated this revision to Diff 75100.Oct 18 2016, 6:05 PM
jhen marked 5 inline comments as done.
  • Remove unused Platform::getContext function
  • Documentation fixes
  • Respond to jlebar's comments on acxxel.cpp
jhen added a comment.Oct 18 2016, 6:07 PM

Latest patch responds to jlebar's comments on acxxel.cpp and does a couple of other things.

  • Removes old Platform::getContext function. It used to be used for launching OpenCL kernels, but is not needed now.
  • Cleans up a bunch of minor documentation stuff.
acxxel/acxxel.cpp
37

No need to run the cleanup. I changed it to Expected<Platform *>.

jhen updated this revision to Diff 75101.Oct 18 2016, 6:20 PM
  • Remove fixed_vector.h
jlebar added inline comments.Oct 18 2016, 6:29 PM
acxxel/cuda_acxxel.cpp
30

Does this have to be per-thread rather than per stream? Kind of a bummer that one thread can't operate on two streams for different devices (among other things).

At least maybe setActiveDevice should be called setActiveDeviceForThread.

170

It looks like this map always has the keys 0, 1, ..., n. So, could we make it a vector? That will make lookups substantially cheaper, although I don't know if we care.

194

How does this work, exactly? We set this thread's active device index to 0, but new threads have it as uninitialized memory? That seems not so good.

284

Forgot to convert to seconds.

477

std::move(Callback)

498

Could we use std::array so we don't have to do these ARRAYSIZE shenanigans?

jhen updated this revision to Diff 75163.Oct 19 2016, 9:56 AM
jhen marked 4 inline comments as done.
  • Respond to jlebar's comments on cuda_acxxel.cpp
jhen added inline comments.Oct 19 2016, 10:00 AM
acxxel/cuda_acxxel.cpp
30

TL;DR the CUDA driver library forces us to do it per thread or to suffer a performance penalty.

I chose it to be per thread for efficiency reasons. The old StreamExecutor tried to do it per stream, and this led to extra library calls for every operation to check that the current device was equal to the one expected by the Stream. Ideally we would just store a "context" in the stream, the context would know which device it belonged to, and the context would be used to launch all work, but unfortunately that is not how the CUDA driver library works. Instead, the CUDA driver library works a lot like this where it allows you to set the active context, but then you don't pass the context to each method call, instead, it just uses the active context.

It actually doesn't prevent a thread from using streams from different devices. It just forces that thread to do the "active device" bookkeeping on its own. Want to use stream0? First set the active device to device0. Want to go back to stream1? Remember to set the active device back to device1.

I did change the name as you suggested, though.

194

Good point. I want every thread to start with active device set to zero because that's the way the CUDA runtime does it. I moved this initialization to the declaration of ActiveDeviceIndex. Based on some experiments and my interpretation of the standard, that should do the right thing.

jhen updated this revision to Diff 75167.EditedOct 19 2016, 10:19 AM
  • Fix deleted Span container constructor

I had forgotten to fix a detail of the Span class implementation before posting this patch. This patch cleans up that detail.

This all looks really good to me.

acxxel/cuda_acxxel.cpp
30

I chose it to be per thread for efficiency reasons.

OK, I am totally onboard with that -- zero-cost abstractions.

But maybe we can have assert()s that ensure that users are always switching to a stream's device before touching that stream? If we have to have an ifndef NDEBUG variable or two inside the stream class to support this, that's fine by me.

We also need to have comments explaining this weird requirement.

acxxel/examples/opencl_example.cpp
56

Nit, we can get away with single braces here, which I think is more natural.

59

If we're using assert(), we should #error if we're built with NDEBUG, or alternatively #undef NDEBUG as the first thing in this file (before any #includes).

acxxel/opencl_acxxel.cpp
24

I wonder if we need this extra namespace (same for cuda). We don't actually expose anything other than the make-platform function in it. Do we plan on exposing anything else?

39

Do you want this to be explicit?

45

Same for CUDA: Using TLS like this means that it doesn't ever make sense for us to have two platform objects going. Which is probably correct for other reasons, too.

So maybe instead of having createFoo functions, we should have getFoo functions that return a singleton (created lazily, using static Foo* = new Foo).

I know this is basically what the current string-to-platform function is trying to do, but I guess the suggestion is to push the singleton-ness down into the lower-level factory functions, since singletonness is necessary for correctness.

73

This could live inside the anon ns? That's a potential speedup, too, since it lets the compiler make assumptions about this class that it can't otherwise.

Same for CUDA, if applicable (I didn't check).

176

s/100/MaxNumEntries/?

214

It's a nit, but I'm not sure we need this first if statement?

535

Did we decide to make streams single-threaded to avoid this overhead?

If so, in addition to getting rid of the mutex, we should get rid of the comment about concurrently enqueueing an event, since it's not valid to touch the stream concurrently at all.

acxxel/span.h
10

I really don't like these top-level comments that say nothing you couldn't have guessed from the filename, even though we have them all over LLVM.

Does having this really help doxygen? If so, would it make doxygen look awful if we moved the comment currently on the span class into this block, so that it actually provided some value?

26

Can we make this constexpr?

59

I am really surprised that this is what they suggest in the proposal. It's fine for now, if it becomes a problem we can turn it into an assert().

94

Nit, could we just use data() in the decltype? We're going to use it in the constructor below anyway. Or is this as specified and the spec is just weird?

195

I cannot figure out the reason that some of these functions are noexcept but not others.

acxxel/status.h
22
34

s/an/a/, and below.

57

Can we not =default these four?

82

Since we're going to take ownership of the status, we should take it by value, rather than const ref (and then std::move it into TheError).

This will make it work efficiently if we're passed an rvalue.

94

We can do the same thing with Value -- just take it by value and std::move it.

154

dblaikie taught me today that we could also take Expected by value in our operator=. This would let us have
just one overload instead of two, kind of nice.

acxxel/tests/acxxel_test.cpp
359

Can we use proper synchronization operations, like a condvar? volatile doesn't necessarily do what we want on all platforms -- just sort of happens to work on x86.

362

Wait...we have EXPECT*? Can we use this in the tests where we're currently using assert()s?

acxxel/tests/span_test.cpp
24

Nit, maybe we should call this arraySize, since it only works on arrays.

37

Nit, not sure why we have a variable for this. If you just want to make it clear what the second arg to the Span constructor means, you could write

Span(nullptr, /* size = */ 0);

I believe we even have a clang-tidy check to check comments in this style.

79

Nit, should be consistent wrt we have an explicit length in the array or not. I kind of prefer not. (This appears in other files as well.)

86

Nit, prefer constructing std::arrays with = {1, 2, 3}, as it's more natural. Also, double-braces have this truly awful footgun, which has permanently scarred those of us who have rediscovered it ourselves: https://youtu.be/rNNnPrMHsAA?t=11m15s

jhen updated this revision to Diff 75659.Oct 24 2016, 4:47 PM
jhen marked 21 inline comments as done.
  • Remove asserts in OpenCL example
  • Respond to jlebar's OpenCL, util comments
jhen added a comment.Oct 24 2016, 4:47 PM

In addition to responding to jlebar's posted comments, I also removed the acxxel::getPlatform function and replaced it with the two functions acxxel::getCUDAPlatform and acxxel::getOpenCLPlatform. I also added a comment to explain that the CUDA and OpenCL platforms are available out of the box with Acxxel, but that other platforms can be created as well. The old acxxel::getPlatform function made it confusing to think about how to add a new platform because it seemed like the new platform should also be registered somehow to put in on equal footing with CUDA and OpenCL. I hope this new design will be clearer in this aspect.

acxxel/cuda_acxxel.cpp
30

I don't think that assert() calls are useful to check this because the platform libraries themselves (CUDA, OpenCL, etc.) will emit errors for these cases.

I added comments to Stream to explain its associated device index and caution about using it when the active device index doesn't match, and I added a function to query the device index for a Stream.

acxxel/examples/opencl_example.cpp
56

I blame -Wall -Wextra for this. I changed it to -Wall -Wextra -Wno-missing-braces to get rid of the warning for this case.

59

I changed assert to "check and exit" to prevent having to deal with this.

acxxel/opencl_acxxel.cpp
24

I mostly got rid of these namespaces, but, as part of another design change, I moved the platform creation (now getter) functions into them.

39

Yes, I just forgot to mark it. Thanks for catching that.

535

Thanks for catching this. I got rid of the mutex and got rid of the whole CLEvent struct, since it is now just a cl_event*.

I also got rid of some comments for addCallback that talked about accessing the same stream from multiple threads, but I didn't see a comment that talked about Event objects being enqueued on streams concurrently. Please let me know if I didn't get rid of a comment I should have.

acxxel/span.h
10

I don't like them either. They don't help with Doxygen as far as I know. I just put them in because I thought they were required for LLVM style. I've removed most of them now, but I did leave a few that I thought were actually serving a purpose.

59

Sounds good. Like you say, it should be easy to change.

94

I did this to try to make sure that the container supported operator[] (the proposed spec suggested I should check for this). Do you think there is a cleaner way to do that?

195

I don't understand it either. In the class comment I added a link the spec document I used. That's where I got all these signatures.

acxxel/status.h
22

Unfortunately gcc only seems to support this as a function attribute (and that seems consistent with the documentation you cited). When I tried to use it as a class attribute, I got a compilation error.

34

Oops. Status used to be called Error.

57

I learned something sad while trying to do this. Status has a std::string field, and it seems that the move assignment operator for std::string is not marked noexcept in my standard library. This means that the default move assignment operator for Status is not noexcept and I get a compile error if I try to declare it noexcept and default.

To deal with this, I chose to retain the noexcept and to re-implement the default move assignment operator. I was able to default the other 3 with my standard library, so I did that.

94

I think I understood you correctly here. I collapsed the const T &Value and T &&Value constructors into one constructor taking the argument by value and moving it into TheValue. Let me know if I was only supposed to replace one of these constructors that way.

acxxel/tests/acxxel_test.cpp
362

I couldn't find the places I am using assert() in tests. I think you might be referring to the stuff in the examples directory. That stuff doesn't depend on gtest, so it uses assert() instead.

acxxel/tests/span_test.cpp
37

This is a fun one. If I pass in a literal 0 it tells me the call is ambiguous because Span<int> also has a constructor with the signature Span<int>(int *FirstElem, int *LastElem). It seems that C++ is helpfully willing to convert a literal 0 to an int * (I'm sure it has to do with NULL and C compatibility).

I still thought it was a useful test, though, because the size argument could be stored in a variable as I have done in my test.

79

I think I got all the places where I was doing this.

I am happy with this except for my concern about the status handling on sync.

Would it be worth updating (some of?) the examples to the fluent interface?

acxxel/acxxel.cpp
26

These namespaces seem redundant with the function names? What are you trying to achieve by keeping these out of the root acxxel namespace? (I suspect you have a good reason...)

acxxel/acxxel.h
228

"platform's active device index for the thread" is awkward. This was hard for me to rephrase, but I think maybe it's better if we move the notion that it's platform-specific to the front of the paragraph, so we don't have to repeat it. Something like

Each Platform has a notion of the currently active device on a particular thread (see Platform::getActiveDeviceForThread and Platform::setActiveDeviceForThread).  Each Stream is associated with a specific, fixed device, set to the current thread's active device when we create the Stream.  Whenever a thread enqueues commands onto a Stream, its active device must match the Stream's device.
247

Upon reconsideration, I am not entirely sold on how we're handling errors.

If you use the fluent API, you might reasonably write code like:

Status s = stream.do_something().copy_async().sync();
if (!s.ok()) return s;

But this is subtly wrong. It will usually do what you want (return errors that occurred in any of the three operations), but if we happen to get context-switched in between the first two operations, an error from the first one may be reported when we launch the second one. In which case the sync may return OK.

(You can write effectively the same code with the non-fluent API as well.)

This problem is exacerbated by the fact that Status is now a warn-if-unused type, so if someone writes the above code and *doesn't* look at the return value, we'll warn them to!

We could fix this by making sync() return void. Alternatively, we could make sync() implicitly do a takeStatus() (we'd want to comment this). I kind of like the latter, because, presuming that you compile with clang and synchronize your streams, this makes it hard to forget error checking. We could add the warn_unused attribute to this function and then get warnings with gcc, too. (You could do that in a separate patch if you wanted.)

What do you think?

acxxel/cuda_acxxel.cpp
537

Storing it as a value type here will cause the destructor to run on program shutdown. :)

Now that I see this I like even more that we pushed the encapsulation down to these functions, because it gives you the flexibility to run code on program shutdown, or not, depending on what your platform needs.

acxxel/examples/opencl_example.cpp
59

Oh, I didn't realize that this wasn't a test, duh. Sorry about that.

As you have it is fine, or asserts are also fine, since it's just an example.

acxxel/opencl_acxxel.cpp
535

You got the comments I was after; we're good.

acxxel/span.h
94

If the spec says to check for operator[], I'm down with this as-is.

acxxel/status.cpp
17 ↗(On Diff #75659)

As far as I can tell from cppreference, it's just not noexcept period.

Any reason you don't want these in the header file?

acxxel/status.h
22

Aha, I see. That's cool, I didn't realize that clang let you warn_unused_result on a class.

Maybe this macro should have a slightly different name, to address my point of confusion. Then maybe we can have a comment to the effect of your response above?

57

OK, I think that's safe because std::string's move constructor doesn't throw, as I read http://en.cppreference.com/w/cpp/string/basic_string/basic_string.

94

Yep. It's slightly less efficient because we move value twice instead of just once, but I seriously doubt it's going to matter.

105

Unnecessary now that we're taking That by value.

acxxel/tests/acxxel_test.cpp
353

Because we're now reading and writing it from inside a mutex, we don't need GoFlag to be an atomic. (I mean, I suppose it doesn't hurt...)

MarkerFlag is read outside a mutex, so it still needs to be an atomic, or otherwise we could just guard everything with the one mutex -- that seems like the simplest approach to me.

acxxel/tests/span_test.cpp
37

Hahaha, okay then. :)

jhen updated this revision to Diff 75737.Oct 25 2016, 11:04 AM
jhen marked 6 inline comments as done.
  • Respond to jlebar's comments 2016-10-24
jhen added a comment.Oct 25 2016, 11:04 AM

Would it be worth updating (some of?) the examples to the fluent interface?

Good idea, I updated both of them.

acxxel/acxxel.cpp
26

Maybe not a good reason, but the reason I did it was because I also defined functions with the same name below. The functions below are wrappers for these functions, and the wrappers are responsible for returning an error if the library was compiled without support for the given platform.

To try to remove the redundancy, I changed both of these function names to just getPlatform and left the namespaces in place to distinguish which platform they are for. I'm happy to rename these functions to anything else, but I just want to preserve the public names getCUDAPlatform and getOpenCLPlatform for the wrappers.

acxxel/acxxel.h
228

Thanks! As you can tell, I was having a tough time explaining things there.

247

sync() is already doing an implicit takeStatus(), as you suggested. I didn't realize it wasn't mentioned in the comments. I've fixed the comments now.

I also added warn_if_unused the Stream functions returning a Status so that gcc can produce nice warnings too.

acxxel/cuda_acxxel.cpp
537

Now storing it as a raw pointer instead. Same change for OpenCL.

acxxel/status.cpp
17 ↗(On Diff #75659)

Changed the comment simply to say that the move assignment operator for std::string is noexcept.

No good reason not to have these in the header. I moved them there.

acxxel/status.h
22

I changed the name to ACXXEL_WARN_UNUSED_RESULT_TYPE and added a comment explaining why it only works with clang.

jlebar accepted this revision.Oct 25 2016, 11:42 AM
jlebar edited edge metadata.

\o/

acxxel/cuda_acxxel.cpp
547

Can we just return MaybePlatform now, or does that not work because the types don't match? If so, maybe we should actually have a constructor on Expected which does this type conversion. (unique_ptr has something similar.) But we could do that in a separate patch.

This revision is now accepted and ready to land.Oct 25 2016, 11:42 AM
jhen updated this revision to Diff 75774.Oct 25 2016, 1:02 PM
jhen edited edge metadata.
  • Add ctors for Expected(Expected<U>)
jhen added inline comments.Oct 25 2016, 1:05 PM
acxxel/cuda_acxxel.cpp
547

I added the templated copy constructor to make this work (and the corresponding templated move constructor while I was at it), so now we are just returning MaybePlatform.

One thing I wasn't sure about when adding the templated constructors was whether I should remove the corresponding non-templated constructors because they now seem redundant. I didn't remove them, though, because std::unique_ptr seems to keep both the templated and non-templated version.

jlebar added inline comments.Oct 25 2016, 1:14 PM
acxxel/cuda_acxxel.cpp
547

Yes, I think it probably makes sense to keep the non-templated constructors, although I am not 100% sure if they're necessary, like you say.

This revision was automatically updated to reflect the committed changes.