Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

[Clang][AArch64] Fine-grained ldp and stp policies.
Needs ReviewPublic

Authored by manosanag on Sep 7 2023, 4:35 AM.



This patch enables fine-grained tuning control for ldp and stp.

It provides two new and concrete command-line options -aarch64-ldp-policy
and -aarch64-stp-policy to give the ability to control load and store
policies seperately with both clang and flang-new frontends including
when using -flto.

The accepted values for both options are:

  • default: Use the ldp/stp policy currently used by the compiler (always).
  • always: Emit ldp/stp regardless of alignment.
  • never: Do not emit ldp/stp.
  • aligned: In order to emit ldp/stp, first check if the load/store will be aligned to 2 * element_size.

Diff Detail

Event Timeline

manosanag created this revision.Sep 7 2023, 4:35 AM
Herald added a project: Restricted Project. · View Herald Transcript
manosanag requested review of this revision.Sep 7 2023, 4:35 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 7 2023, 4:35 AM
manosanag updated this revision to Diff 556154.Sep 7 2023, 8:10 AM

Updated to provide visibility for the options, because
it caused a regression for my fortran tests after rebasing to
current llvm main branch.

Can you give more details about why this is wanted and in which cases it helps with? Is it an optimization, as opposed to working around some correctness issue?

manosanag added a comment.EditedSep 8 2023, 12:54 AM

Hello Dave,

thanks for replying.

Yes, this is an optimization.

On some AArch64 cores, including Ampere's ampere1 architecture that this is targeted for, load/store pair instructions are faster compared to simple loads/stores only when the alignment of the pair is at least twice that of the individual element being loaded. Based on the performance of various benchmarks, emitting ldp/stp instructions was disabled on GCC at some point (discussion is This patch improves on that and offers control over when the instructions are used.

Similar patch with the same flags has been recently submitted for review in the GCC mailing lists (

I have a fix ready for the fortran regressions shown by autotesting. I could include some of this information to the commit message of the diff.

Should we move this to a GitHub PR instead?

We do not usually add front-end clang options for optimizations like this. Users are more likely to use them incorrectly, or just not know that they exist. The usual method would be to make a subtarget tuning feature that controls whether ldp are created, and enable it for -mcpu=ampere1.

Having an internal llvm option for it (-mllvm -aarch64-stp-policy=never) sounds fine, but should be considered an internal option. And adding a subtarget feature would make sense to have this be used from ampere1. If you get the option committed to GCC then it might be OK for clang too, but I would suggest splitting this into a patch for the backend part and another for the frontend option in either case.