The Technical Reference Manuals for these two CPUs state that branching to an unaligned 32-bit instruction incurs an extra pipeline reload penalty. That's bad.
I also enable the optimization at -Os for just these two CPUs. My impression has been that it's a bit of a gamble with the bigger cores, and it also wastes more space. But for these two we're getting 1 cycle per iteration in return for 1 byte per loop (on average); that seems like it definitely fits into LLVM's quirky definition of -Os.
I'm open to extending it to other processors, but my research indicates Cortex-M0 is too simple to benefit (it claims conditional branches are always 3 cycles if taken), and Cortex-M7 has no performance documentation.
This seems like an odd change to make at Os. It, by definition, increases code size.
Like you said, this might be an llvm quirk though. Do you have a link to llvm's definition of -Os?