Performing a load before calling __cxa_guard_acquire is supposed to be an optimization, but it isn't much of one if we're just going to emit a call to __atomic_load_1 instead. Instead, just skip the load, and let __cxa_guard_acquire do whatever it wants.
(In practice, on such targets, the C++ library is just built with threading turned off, so the result isn't actually threadsafe, but there's not really anything clang can do about that.)
The alternative here is that we try to define some ABI for threadsafe init that allows the speculative load without full atomics. Almost any target without full atomics has a load that's s "atomic enough" for this purpose. But it's not clear how we emit an "atomic enough" load in LLVM IR, and there isn't any ABI document we can refer to.
Or I guess we could turn off -fthreadsafe-statics by default on Cortex-M0, but that seems like it would be surprising.