- Set MaxAtomicPromoteWidth and MaxAtomicInlineWidth to 128 in the front end. This will result in LLVM I/R in the aligned cases and libcalls otherwise.
- Some FE tests for smaller integer types as well.
- Enable AtomicExpandPass.
- Set MaxAtomicSizeInBitsSupported to 128.
- Fix RegCoalescer for these types of loops, which already has a hack for i128 in the SystemZ backend.
Does it help tweaking this heuristic a bit? What if we use 4 or 2 instead of 3?