This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Teach the ARM Load Store Optimizer to collapse ldr/str's to ldrd/strd's
Needs ReviewPublic

Authored by rs on Apr 27 2015, 10:00 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This patch adds support to the ARM Load Store Optimizer to generate ldrd/strd's for V7-M class cores. I've adapted code from the AArch64 Load Store Optimizer to implement this optimization in the ARM Load Store Optimizer and I've kept the comments the same in some places.

This patch will only collapse ldr/str's to ldrd/strd for V7-M, a follow up patch will add support for generating these instruction sequences for V7-AR class cores.

Diff Detail

Event Timeline

rs updated this revision to Diff 24484.Apr 27 2015, 10:00 AM
rs retitled this revision from to [ARM] Teach the ARM Load Store Optimizer to collapse ldr/str's to ldrd/strd's.
rs updated this object.
rs edited the test plan for this revision. (Show Details)
rs added a subscriber: Unknown Object (MLST).

I don't know much about the ARM Load/Store Optimizer but from looking at it there's already some machinery for generating LDRD/STRD. Why is this a separate LoadStoreToDoubleOpti function instead of being integrated into LoadStoreMultipleOpti?

Always favouring LDRD/STRD is probably a little simplistic. LDRD/STRD is better than LDM/STM in that:

  • it has more flexible addressing, so can be used in cases where LDM/STM can't
  • it may be faster on the cpu you're compiling for (from a brief peruse of TRMs: on pre-7A and Cortex-M3/4 LDM may be faster, otherwise LDRD is at least as fast or may be faster)

Also: when optimizing for size we would want to use LDM if it means less bytes worth of instructions.

lib/Target/ARM/ARMLoadStoreOptimizer.cpp
67–70

By default always using ldrd/strd without reference to if it's faster on the target CPU sounds like a bad idea, e.g. according to the Cortex-M3 TRM LDRD is 3 cycles, but LDM is 2 + (nr registers - 1).

1821–1831

Actually the even/odd restriction is in A32 restriction, not a non-M-class restriction, i.e. in 7-A/R T32 there should be no problem.

I agree with John, would be much better if this code was an extension, not a copy of the current machinery.

I've adapted code from the AArch64 Load Store Optimizer to implement this optimization in the ARM Load Store Optimizer and I've kept the comments the same in some places.

Can you be more specific about what you adapted? Keeping the same comment in the same places is not always the correct thing to do, but more importantly, copy&paste is most definitely not the right thing to do.

cheers,
--renato

rengolin added inline comments.Apr 28 2015, 5:33 AM
lib/Target/ARM/ARMLoadStoreOptimizer.cpp
1821–1831

Certainly the wrong way. A better way would be to have a flag in table gen (like fast-double-store or whatever). The best way would be to have a cost-model, like we have for the vectorizer, but that would be a big change for this small patch.

rs added a comment.Apr 29 2015, 8:32 AM

Hi John and Renato,

Thanks for the review comments.

I don't know much about the ARM Load/Store Optimizer but from looking at it there's already some machinery for generating LDRD/STRD

The other machinery you're referring to is part of the ARM Pre RegAlloc Pass "Pre- register allocation pass that move load / stores from consecutive locations close to make it more likely they will be combined later.".
I haven't looked too much at this pass but I do see the method 'CanFormLdStDWord' which I could use.

Why is this a separate LoadStoreToDoubleOpti function instead of being integrated into LoadStoreMultipleOpti?

I thought the code would be cleaner if this was done in a separate function to LoadStoreMultipleOpti.
The grouping algorithm I use (AArch64 uses) for finding ldr/str's to pair together is different to what
LoadStoreMultipleOpti uses. But I guess I can plug in my code inside the if(TryMerge) { ... } region and
instead of collapsing all the load/stores into an ldm/stm for V7M I can instead iterate through the list 2 at a time.
If you prefer it to be part of LoadStoreMultipleOpti then I can rework the patch to make it so.

Also: when optimizing for size we would want to use LDM if it means less bytes worth of instructions.

Ok.

Can you be more specific about what you adapted?

The AArch64 Load/Store optimizer collapses pairs of ldr/str instructions to ldp/stp instructions.
The algorithm it uses to find pairs of ldr/str instructions and the merging step
is what I needed to get the ARM Load/Store Optimizer to collapse ldr/str instructions into ldrd/strd instructions. So what
I took from the AArch64 backend is the following:

  • AArch64LoadStoreOpt::optimizeBlock - renamed to ARMLoadStoreOpt::LoadStoreToDoubleOpti
    • main loop almost the same except it looks for ldr/str thumb2 instructions to collapse
  • AArch64LoadStoreOpt::findMatchingInsn - ARMLoadStoreOpt::findMatchingInsn
    • Most of it is the same except I've removed unscaled offset checking. I also added a check for "Cortex-M3 errata 602117".
  • AArch64LoadStoreOpt::mergePairedInsns - renamed to ARMLoadStoreOpt::mergePairedInsns
    • almost the same but removed anything to do with sign extensions that was in the AArch64 Load/Store.

Ideally I wouldn't have had to copy over code from the AArch64 backend but there isn't a shared code directory for the AArch64 and ARM backends.

Keeping the same comment in the same places is not always the correct thing to do

I've kept the same comments in places where I couldn't have put the description of the behaviour of the code any better myself.

rs added inline comments.Apr 29 2015, 8:34 AM
lib/Target/ARM/ARMLoadStoreOptimizer.cpp
1821–1831

A better way would be to have a flag in table gen (like fast-double-store or whatever).

OK this approach sounds better, will do it this way for my next patch.

In D9298#163286, @rs wrote:

If you prefer it to be part of LoadStoreMultipleOpti then I can rework the patch to make it so.

Yes, that would be better. It looks like MergeOps is the function that gets a set of registers then tries to generate an LDM from them. There you could put some stuff to decide whether to instead break it up into a sequence of LDRD. Looks like you may need to adjust MergeLDR_STR also: it only collects together ascending sequences but thumb2 LDRD doesn't require that.

lib/Target/ARM/ARMLoadStoreOptimizer.cpp
1821–1831

There's some stuff in ARMBaseInstrInfo, e.g. getOperandLatency, getNumMicroOps that appears to understand the timing of LDM/STM instructions. Maybe it's possible to use that plus LDRD/STRD timing information? I.e. instead of putting something in tablegen then assuming "well, LDRD is fast so I'm going to guess that here it'll be faster than LDM" instead calculate what the timing of LDRD and LDM would be for loading a set of registers and use whichever is quickest.