From v8.4a onwards, aligned 128-bit ldp and stp instructions are guaranteed to be single-copy atomic, so they are going to be a lot more efficient than the CAS loop we used to implement "load atomic" and "store atomic" before even if we do need a DMB sometimes. Additionally, some earlier CPUs had this property anyway so it makes sense to use it.
Finally, even before this guarantee there are machine-specific circumstances where a 128-bit ldp/stp makes sense but doesn't really fit an atomic profile. So this extends the selection to volatile accesses, even if they're not aligned (presumably the coder knows what they're doing). The one exception for volatile is when -mstrict-align is in force, that should take precedence.