This seems strictly more likely to be correct and is also probably more
efficient on some platforms. I think I've gotten the memory orderings
correct, but a second pair of eyes is very welcome here, notably to make
sure that in all cases the Done synchronizes with works correctly and it
is not possible to release threads prior to the effects of calling the
function being visible to all threads.
If this looks good, and it sticks, and ...., I'll be able to remove
llvm::cas_flag entirely.