This is an archive of the discontinued LLVM Phabricator instance.

[Driver] Default -msmall-data-limit= to 0
Needs ReviewPublic

Authored by MaskRay on Jun 6 2023, 8:35 AM.

Details

Summary

D57497 added -msmall-data-limit= as an alias for -G and defaulted it to 8 for
-fno-pic/-fpie.

GCC documents this option as "Put global and static data smaller than <number> bytes into a special section (on some targets)."
The targets are not documented, but it seems to not use the value for -fpie.

I think the different behavior for -fno-pic/-fpie/-fpic is not a good
idea and defaulting to -msmall-data-limit=0 makes more sense.

Diff Detail

Event Timeline

MaskRay created this revision.Jun 6 2023, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2023, 8:35 AM
MaskRay requested review of this revision.Jun 6 2023, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2023, 8:35 AM
MaskRay added inline comments.Jun 6 2023, 8:36 AM
clang/lib/Driver/ToolChains/Clang.cpp
2087

The code is written in a confusing way with several problems including the unclear diagnostic. I just left a comment on https://reviews.llvm.org/D57497#4399715 and don't want to address them in this patch.

pirama added a subscriber: pirama.Jun 6 2023, 8:53 AM
phosek added a subscriber: phosek.Jun 7 2023, 11:15 AM

We're planning to default to -msmall-data-limit=0 for Android and Fuchsia so I'm supportive of this change because it means less complexity and fewer differences between platforms.

I think it would be worth bringing up this topic at the RISC-V LLVM sync-up call tomorrow.

asb added a comment.Jun 7 2023, 11:31 AM

We're planning to default to -msmall-data-limit=0 for Android and Fuchsia so I'm supportive of this change because it means less complexity and fewer differences between platforms.

I think it would be worth bringing up this topic at the RISC-V LLVM sync-up call tomorrow.

Sounds good, I'll put it on the agenda.

asb added a comment.Jun 8 2023, 8:22 AM

One of the key things we've been discussing on this at the LLVM call is that we probably want to keep the small data limit for embedded targets.

MaskRay added a comment.EditedJun 8 2023, 2:40 PM

One of the key things we've been discussing on this at the LLVM call is that we probably want to keep the small data limit for embedded targets.

It'd be useful to hear from some concrete embedded target users, whether they customize -msmall-data-limit=, whether they are happy with -msmall-data-limit=8, or whether they are just unaware of this threshold.

If an embedded system typically customizes compiler driver options a lot, I think they can consider adding -msmall-data-limit= (with a value suitable for their use cases) in a configuration file, not letting their use cases dictate Linux systems.

I putting up the patch partly came from my stance finding UX of this option is really confusing and misleading. I wish that the embedded target users can give me compelling arguments to keep -msmall-data-limit=8 (and not another value).

One of the key things we've been discussing on this at the LLVM call is that we probably want to keep the small data limit for embedded targets.

It'd be useful to hear from some concrete embedded target users, whether they customize -msmall-data-limit=, whether they are happy with -msmall-data-limit=8, or whether they are just unaware of this threshold.

If an embedded system typically customizes compiler driver options a lot, I think they can consider adding -msmall-data-limit= (with a value suitable for their use cases) in a configuration file, not letting their use cases dictate Linux systems.

I putting up the patch partly came from my stance finding UX of this option is really confusing and misleading. I wish that the embedded target users can give me compelling arguments to keep -msmall-data-limit=8 (and not another value).

The default of 8 is probably to make it consistent with gcc. Here is text from man gcc

-msmall-data
-mlarge-data
    When -mexplicit-relocs is in effect, static data is accessed
    via gp-relative relocations.  When -msmall-data is used,
    objects 8 bytes long or smaller are placed in a small data
    area (the ".sdata" and ".sbss" sections) and are accessed via
    16-bit relocations off of the $gp register.  This limits the
    size of the small data area to 64KB, but allows the variables
    to be directly accessed via a single instruction.

    The default is -mlarge-data.  With this option the data area
    is limited to just below 2GB.  Programs that require more
    than 2GB of data must use "malloc" or "mmap" to allocate the
    data in the heap instead of in the program's data segment.

    When generating code for shared libraries, -fpic implies
    -msmall-data and -fPIC implies -mlarge-data.
MaskRay added a comment.EditedJun 12 2023, 6:41 PM

One of the key things we've been discussing on this at the LLVM call is that we probably want to keep the small data limit for embedded targets.

It'd be useful to hear from some concrete embedded target users, whether they customize -msmall-data-limit=, whether they are happy with -msmall-data-limit=8, or whether they are just unaware of this threshold.

If an embedded system typically customizes compiler driver options a lot, I think they can consider adding -msmall-data-limit= (with a value suitable for their use cases) in a configuration file, not letting their use cases dictate Linux systems.

I putting up the patch partly came from my stance finding UX of this option is really confusing and misleading. I wish that the embedded target users can give me compelling arguments to keep -msmall-data-limit=8 (and not another value).

The default of 8 is probably to make it consistent with gcc. Here is text from man gcc

-msmall-data
-mlarge-data
    When -mexplicit-relocs is in effect, static data is accessed
    via gp-relative relocations.  When -msmall-data is used,
    objects 8 bytes long or smaller are placed in a small data
    area (the ".sdata" and ".sbss" sections) and are accessed via
    16-bit relocations off of the $gp register.  This limits the
    size of the small data area to 64KB, but allows the variables
    to be directly accessed via a single instruction.

    The default is -mlarge-data.  With this option the data area
    is limited to just below 2GB.  Programs that require more
    than 2GB of data must use "malloc" or "mmap" to allocate the
    data in the heap instead of in the program's data segment.

    When generating code for shared libraries, -fpic implies
    -msmall-data and -fPIC implies -mlarge-data.

This GCC documentation is about the rx port. GCC's alpha and rx ports support -msmall-data, but RISC-V just says:

-msmall-data-limit=n Put global and static data smaller than n bytes into a special section (on some targets).

Matching GCC behaviors give many benefits and Clang does this a lot, especially in cases where matching/not-matching doesn't make much difference (i.e. we don't need to have an opinion).
However, RISC-V -msmall-data-limit= is probably a case warranting a difference.
The global pointer relaxation has a very limited value (benchmarked by multiple parties, including a party which implemented this feature in the GNU toolchain: even they can only say the optimization only applies to very specific projects).
The default value is confusing: as explained by the summary.

I suspect that -msmall-data-limit=8 is too conservative, maybe 16 would be better, I don't know. I think global pointer relaxation users should toggle this by themselves, not relying on 0 or 8 default decided by a bunch of strange conditions.

(Actually, I suspect that there is some willingness in changing GCC's default as well. Unfortunately, there is some political reasons blocking that.)

asb added a comment.Jun 14 2023, 1:31 AM

However, RISC-V -msmall-data-limit= is probably a case warranting a difference.
The global pointer relaxation has a very limited value (benchmarked by multiple parties, including a party which implemented this feature in the GNU toolchain: even they can only say the optimization only applies to very specific projects).
The default value is confusing: as explained by the summary.

I suspect that -msmall-data-limit=8 is too conservative, maybe 16 would be better, I don't know. I think global pointer relaxation users should toggle this by themselves, not relying on 0 or 8 default decided by a bunch of strange conditions.

I don't disagree that the small data limit being 8 rather than something else doesn't seem to be particularly well motivated, but I understand that the case where the option does make a difference is on embedded targets (I think the data that was shared before was for SPEC, but could be wrong?). I think we can generally expect more willingness for people targeting embedded systems to explore different compiler flags, but just matching gcc does feel like a better default. What do you think about keeping the default for bare metal targets?

However, RISC-V -msmall-data-limit= is probably a case warranting a difference.
The global pointer relaxation has a very limited value (benchmarked by multiple parties, including a party which implemented this feature in the GNU toolchain: even they can only say the optimization only applies to very specific projects).
The default value is confusing: as explained by the summary.

I suspect that -msmall-data-limit=8 is too conservative, maybe 16 would be better, I don't know. I think global pointer relaxation users should toggle this by themselves, not relying on 0 or 8 default decided by a bunch of strange conditions.

I don't disagree that the small data limit being 8 rather than something else doesn't seem to be particularly well motivated, but I understand that the case where the option does make a difference is on embedded targets (I think the data that was shared before was for SPEC, but could be wrong?). I think we can generally expect more willingness for people targeting embedded systems to explore different compiler flags,

Agree.

but just matching gcc does feel like a better default. What do you think about keeping the default for bare metal targets?

I am unsure we want to add the complexity and refrain from changing the default for bare metal targets due to haunted graveyards https://www.usenix.org/sites/default/files/conference/protected-files/srecon17americas_slides_reese.pdf

We are already different from GCC in a number of ways:

  • for -fpie, we may emit .sdata/.sbss but GCC won't
  • we accept -G while GCC doesn't.
  • we don't emit .srodata.

I made a comment on https://lists.riscv.org/g/sig-toolchains/message/619 and informed GCC folks in case they can change the default as well.
If they don't, I think we made a good choice for not following this particular behavior.

(I feel sad that I did not see D57497 and it landed with a blocked review. If we started from scratch, we would probably run into a cleaner state.)

However, RISC-V -msmall-data-limit= is probably a case warranting a difference.
The global pointer relaxation has a very limited value (benchmarked by multiple parties, including a party which implemented this feature in the GNU toolchain: even they can only say the optimization only applies to very specific projects).
The default value is confusing: as explained by the summary.

I suspect that -msmall-data-limit=8 is too conservative, maybe 16 would be better, I don't know. I think global pointer relaxation users should toggle this by themselves, not relying on 0 or 8 default decided by a bunch of strange conditions.

I don't disagree that the small data limit being 8 rather than something else doesn't seem to be particularly well motivated, but I understand that the case where the option does make a difference is on embedded targets (I think the data that was shared before was for SPEC, but could be wrong?). I think we can generally expect more willingness for people targeting embedded systems to explore different compiler flags,

Agree.

but just matching gcc does feel like a better default. What do you think about keeping the default for bare metal targets?

I am unsure we want to add the complexity and refrain from changing the default for bare metal targets due to haunted graveyards https://www.usenix.org/sites/default/files/conference/protected-files/srecon17americas_slides_reese.pdf

We are already different from GCC in a number of ways:

  • for -fpie, we may emit .sdata/.sbss but GCC won't
  • we accept -G while GCC doesn't.
  • we don't emit .srodata.

I made a comment on https://lists.riscv.org/g/sig-toolchains/message/619 and informed GCC folks in case they can change the default as well.
If they don't, I think we made a good choice for not following this particular behavior.

(I feel sad that I did not see D57497 and it landed with a blocked review. If we started from scratch, we would probably run into a cleaner state.)

Hi MaskRay,

-fpie is an oversight. Thanks for pointing out. For D57497, I thought it behave as good community citizen that trying to address all the comments.
Is there unaddressing comment? If there is, that might not be intended.

The patch tries to mimic the GCC to enable the relaxation. It generally put the data smaller than the threshold to the small data section which near gp and has higher possibility to transfer to gp based load/store.

Relaxation is less useful for large scale software. In previous time, most developers focus on embedded worlds. Now the flags become less friendly for the OS world.
I agree to change the default to make the OS world life easier.

I also agree with Alex. Should we preserve the threshold for embedded usage? Should we try to sync with GCC as possible?

Thank you for chiming in. I disagree that we keep default=8 for embedded without understanding (a) why 8 provides values and (b) justifying that the value is significant enough for embedded to be different.

I think Alex's argument "I think we can generally expect more willingness for people targeting embedded systems to explore different compiler flags" is actually a +1 to decrease the complexity here.

hiraditya resigned from this revision.Aug 17 2023, 5:44 PM

I am still interested in moving this forward. What should be done here? If the decision is to keep the current odd default 8 for toolchains::RISCVToolChain, I guess I'll have to take the compromise as making a step forward is better than nothing.

I am still interested in moving this forward. What should be done here? If the decision is to keep the current odd default 8 for toolchains::RISCVToolChain, I guess I'll have to take the compromise as making a step forward is better than nothing.

On 1 RV64 CPU I tried in our RTL simulator, changing from 8 to 0 reduced dhrystone score by 2.7%. Using 16, or 32 gave the same score as 8. Reducing 8 to 4 improved the score by 0.5%.

I am still interested in moving this forward. What should be done here? If the decision is to keep the current odd default 8 for toolchains::RISCVToolChain, I guess I'll have to take the compromise as making a step forward is better than nothing.

On 1 RV64 CPU I tried in our RTL simulator, changing from 8 to 0 reduced dhrystone score by 2.7%. Using 16, or 32 gave the same score as 8. Reducing 8 to 4 improved the score by 0.5%.

Thank you for sharing the benchmarks!

My view is that global pointer relaxation is an expert option that the user needs to tune (like that ld.lld doesn't default to --relax-gp).
People can create more articles about global pointer relaxation usage, and make the default out of the business of the driver.
If anyone tells me sdata/srodata/sbss adoption for other languages' compiler drivers, I'll definitely tell them not to copy the 0/8 complex rules in clang driver:)

asb added a comment.Sep 5 2023, 5:21 AM

I am still interested in moving this forward. What should be done here? If the decision is to keep the current odd default 8 for toolchains::RISCVToolChain, I guess I'll have to take the compromise as making a step forward is better than nothing.

On 1 RV64 CPU I tried in our RTL simulator, changing from 8 to 0 reduced dhrystone score by 2.7%. Using 16, or 32 gave the same score as 8. Reducing 8 to 4 improved the score by 0.5%.

Thanks for testing - do you have a view one way or another on this default? I'm somewhat torn personally.

I am still interested in moving this forward. What should be done here? If the decision is to keep the current odd default 8 for toolchains::RISCVToolChain, I guess I'll have to take the compromise as making a step forward is better than nothing.

On 1 RV64 CPU I tried in our RTL simulator, changing from 8 to 0 reduced dhrystone score by 2.7%. Using 16, or 32 gave the same score as 8. Reducing 8 to 4 improved the score by 0.5%.

Thanks for testing - do you have a view one way or another on this default? I'm somewhat torn personally.

Is this patch blocked by anything?

MaskRay updated this revision to Diff 557659.Oct 9 2023, 11:11 PM

test rebase