This is an archive of the discontinued LLVM Phabricator instance.

ARM: align loops to 4 bytes on Cortex-M3 and Cortex-M4.
ClosedPublic

Authored by t.p.northover on Sep 7 2018, 5:23 AM.

Details

Summary

The Technical Reference Manuals for these two CPUs state that branching to an unaligned 32-bit instruction incurs an extra pipeline reload penalty. That's bad.

I also enable the optimization at -Os for just these two CPUs. My impression has been that it's a bit of a gamble with the bigger cores, and it also wastes more space. But for these two we're getting 1 cycle per iteration in return for 1 byte per loop (on average); that seems like it definitely fits into LLVM's quirky definition of -Os.

I'm open to extending it to other processors, but my research indicates Cortex-M0 is too simple to benefit (it claims conditional branches are always 3 cycles if taken), and Cortex-M7 has no performance documentation.

Diff Detail

Repository
rL LLVM

Event Timeline

t.p.northover created this revision.Sep 7 2018, 5:23 AM
dmgreen added a subscriber: dmgreen.Sep 9 2018, 1:07 PM

Good stuff. The downstream version of this we have uses Subtarget->isMClass() && Subtarget->hasThumb2(). (Dont ask me why it's downstream, I'm not sure. It's from before my time). I think you are right about the M7 though, the results there are just different, not necessarily better.

We should also be aligning functions?

And we should include the M33 and its derivatives (but like you said, not M0+/M23)

llvm/lib/CodeGen/MachineBlockPlacement.cpp
2500–2501

This seems like an odd change to make at Os. It, by definition, increases code size.

Like you said, this might be an llvm quirk though. Do you have a link to llvm's definition of -Os?

llvm/lib/Target/ARM/ARM.td
950

What do you think of using FeatureAlignBranchTargets or something like it?

llvm/lib/Target/ARM/ARMISelLowering.cpp
1204

Here with this gubbins may be a better place for setting the loop alignment.

We should also be aligning functions?

Interesting question. There are a couple of reasons to think the tradeoffs are different there. First, I'd probably (based purely on intuition rather than data) expect a loop to be executed more times than most functions. Second, when r7 is the FP, a function is almost guaranteed to start with a 16-bit instruction (push {..., r7, lr}).

I do have a captive embedded developer who cares very much about individual cycles at the moment though, so I'll ask him to benchmark it.

Anyway, thanks for commenting. I'll get started on the obvious changes.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
2500–2501
llvm/lib/Target/ARM/ARM.td
950

An excellent idea.

llvm/lib/Target/ARM/ARMISelLowering.cpp
1204

Yep.

Interesting question. There are a couple of reasons to think the tradeoffs are different there. First, I'd probably (based purely on intuition rather than data) expect a loop to be executed more times than most functions. Second, when r7 is the FP, a function is almost guaranteed to start with a 16-bit instruction (push {..., r7, lr}).

The counter argument would be that in the case of function alignment, you don't need to execute the added NOP. For loops, there's always the small chance that you could be running an outer loop more than the inner one, leading to a executed nops that each take a cycle (unless it's the M33 and you get lucky with it's dual issue). If you don't care about the codesize, I think function alignment it will always be a win because the extra padding is just in unused space.

But let us know how the benchmarks look.

I think I've implemented the suggested changes, except for the function alignment one.

The benchmarks came back as about a 0.2% difference in cycle count, and (crucially) there's no way when deciding function alignment to check for OptSize so we'd inevitably pessimize some cases.

dmgreen accepted this revision.Sep 12 2018, 10:39 AM

The benchmarks came back as about a 0.2% difference in cycle count, and (crucially) there's no way when deciding function alignment to check for OptSize so we'd inevitably pessimize some cases.

I think this what getPrefAlignment is for? As opposed to MinFunctionAlignment? I agree that that's a different issue though, and doesn't need to be done with this.

Although I personally think the definition of Os is a bit odd, and this is increasing codesize, everyone else I've talked to agreed with you that this is fine. Like you said, in (almost) all cases we get a performance win for the bytes we spend.

So LGTM!

This revision is now accepted and ready to land.Sep 12 2018, 10:39 AM
t.p.northover accepted this revision.Sep 13 2018, 3:56 AM

Thanks. Committed as r342127.

dmgreen closed this revision.Oct 18 2018, 1:41 AM