Page MenuHomePhabricator

[ARM] Disable LDM with offset for thumb2 cortex-m cpus
Needs ReviewPublic

Authored by dmgreen on Mar 12 2019, 7:55 AM.

Details

Summary

When not optimising for codesize, the extra ADD that can be inserted for the base of an LDM will lead to an extra cycle of latency. Just using LDR's, which are usually pipelined, means fewer total cycles. On Thumb1 cpus, the loads are not pipelined and so it will still be profitable.

This started out as "just turn off the load store optimiser", but has become a little more refined since then. The test case is new, I'm just showing the diff here for clarity.

Diff Detail

Event Timeline

dmgreen created this revision.Mar 12 2019, 7:55 AM
t.p.northover added inline comments.Mar 12 2019, 8:09 AM
llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
677

I think the check should be optForMinSize. Clang interprets -Os in a more performance-oriented way than GCC; something like "don't needlessly bloat code". -Oz is the real option to squash everything as much as possible.

dmgreen marked 2 inline comments as done.Mar 12 2019, 9:35 AM
dmgreen added inline comments.
llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
677

Sure. This does trade size for performance, in that you will get more LDR's, not turned into a single LDM (plus the ADD). Happy to change that though, the example in the test case ends up using an add.w, so I expect the size differences in many cases will not be very large.

dmgreen updated this revision to Diff 190282.Mar 12 2019, 9:36 AM
dmgreen marked an inline comment as done.

Now minsize

Since we're talking about a Thumb2 core, can we form ldrd here?

It may be possibly to create ldrd's, but I don't think that they will be any quicker. An ldrd will take the same time as an ldm (1+N). It could be smaller, depending on whether T1 ldr's are used.

From a quick set of benchmarks I just ran, it looks like on average it did worse with ldrd's than ldr's. Perhaps because of less scheduling freedom?