ARM: expand atomic ldrex/strex loops in IR
The previous situation where ATOMIC_LOAD_WHATEVER nodes were expanded
at MachineInstr emission time had grown to be extremely large and
involved, to account for the subtly different code needed for the
various flavours (8/16/32/64 bit, cmpxchg/add/minmax).
Moving this transformation into the IR clears up the code
substantially, and makes future optimisations much easier:
- an atomicrmw followed by using the *new* value can be more efficient. As an IR pass, simple CSE could handle this efficiently.
- Making use of cmpxchg success/failure orderings only has to be done in one (simpler) place.
- The common "cmpxchg; did we store?" idiom can be exposed to optimisation.
I intend to gradually improve this situation within the ARM backend
and make sure there are no hidden issues before moving the code out
into CodeGen to be shared with (at least ARM64/AArch64, though I think
PPC & Mips could benefit too).