The Technical Reference Manuals for these two CPUs state that branching to an unaligned 32-bit instruction incurs an extra pipeline reload penalty. That's bad.
I also enable the optimization at -Os for just these two CPUs. My impression has been that it's a bit of a gamble with the bigger cores, and it also wastes more space. But for these two we're getting 1 cycle per iteration in return for 1 byte per loop (on average); that seems like it definitely fits into LLVM's quirky definition of -Os.
I'm open to extending it to other processors, but my research indicates Cortex-M0 is too simple to benefit (it claims conditional branches are always 3 cycles if taken), and Cortex-M7 has no performance documentation.