When not optimising for codesize, the extra ADD that can be inserted for the base of an LDM will lead to an extra cycle of latency. Just using LDR's, which are usually pipelined, means fewer total cycles. On Thumb1 cpus, the loads are not pipelined and so it will still be profitable.
This started out as "just turn off the load store optimiser", but has become a little more refined since then. The test case is new, I'm just showing the diff here for clarity.
I think the check should be optForMinSize. Clang interprets -Os in a more performance-oriented way than GCC; something like "don't needlessly bloat code". -Oz is the real option to squash everything as much as possible.