After experimenting with CUDA, I realized that we really only need to
set the active context right before creating an object such as a stream
or a device memory allocation. When we go on to use these objects later,
it is fine if the context that created them is no longer active,
operations with those objects will succeed anyway.
Since it turns out that we don't have to check the active context for
every operation, it makes sense to hide this active context from users
(by removing the "ActiveDeviceForThread" setter and getter) and to
change the Acxxel API to explicitly pass in the device ID to create
objects.
This change improves the Acxxel API and greatly simplifies the CUDA and
OpenCL implementations because they no longer require thread_local data.
Is there a reason we don't default DeviceIndex here as we do elsewhere, or is that just an oversight?