Page MenuHomePhabricator

Optimize access to global variable references in PIE mode when linker supports copy relocations for PIE
Needs ReviewPublic

Authored by tmsriram on May 5 2016, 2:19 PM.

Details

Summary

For X86, in PIE mode, when accessing extern global variables, an extra load from the GOT can be avoided if the linker allows copy relocations for PIE. This patch makes it possible. This is already available in the GCC compiler.

This patch looks for "PIE Copy Relocs" in the module flags to optimize the access to the global variable. The Clang patch that adds a new option "-mpiecopyrelocs" to indicate copy relocation support and add the module flag when the option is passed is here : http://reviews.llvm.org/D19996

GCC patch discussing this : https://gcc.gnu.org/ml/gcc-patches/2014-05/msg01215.html

Short summary:

  • When linking executables in non-PIE mode, references to external globals do not use the GOT. They are treated as if the globals were defined in the executable. If it so happens that the global is not defined in the executable, a copy relocation is used to duplicate the definition in the executable.
  • Currently, in PIE mode, all accesses to external globals use an extra load from the GOT to get its address which is then dereferenced. This is particularly inefficient if it turns out at link time that the global ends up being defined in the executable from a different compilation unit. It turns out that copy relocations can be used here just like non-PIE mode to always avoid the extra load from the GOT and directly access the global.
  • Gold and GNU ld started supporting copy relocations for pie only recently. Gold supports via this patch : https://sourceware.org/ml/binutils/2014-05/msg00092.html
  • Since this technique cannot be used to avoid the extra load if the linker does not support it, I am creating a clang patch to add an extra option '-mpiecopyrelocs' which can be used to specify support.
  • Support for linker copy relocations will be specified in bit code as a module flag "PIE Copy Relocations".
  • This technique can be used for 32 and 64 bit x86. However, gold still does not support copy relocations for 32-bit PIE. I will do that separately.

Diff Detail

Event Timeline

tmsriram updated this revision to Diff 56344.May 5 2016, 2:19 PM
tmsriram retitled this revision from to Optimize access to global variable references in PIE mode when linker supports copy relocations for PIE.
tmsriram updated this object.
tmsriram added a reviewer: rnk.
tmsriram added subscribers: llvm-commits, davidxl, rafael.
tmsriram updated this object.May 5 2016, 2:27 PM

taking a look.

Is this still a problem with linkers which can convert load via GOT slot into load
immediate?

Sorry for the long reply, but I think it is critical we get this right.

When a variable/function is defined in a shared library, either the library or the main program has to use a got. The other one can then use direct accesses.

In every non-elf system, access from inside the library are direct (lea foo(%rip)) and from the main binary indirect (mov foo@GOTPCREL(%rip)). The rationale being that access from within the library is more common and normally the library wants the guarantee that its own functions/variables are not preempted. I think it is critical to support something like this in ELF.

On ELF, quite unfortunately IMHO, the default is to allow preemption. If something can be preempted, you need to use a got, if you need to use a got the main executable may as well create a copy relocation and at least it will be able to use a lea.

So the first question is: In your testcase/benchmark, is preemption required or is the variable/function more frequently accessed from the main binary than from the library that defines it? If not, you would probably get even better performance by disallowing preemption and have the library access its own symbols directly and main binary uses a got.

If you do have a case where the main binary is the one with the most frequent accesses (or preemption is required), I think that is fine. With ELF we should be able to support both cases, we just have to be really careful.

Right now the llvm support for preemption is also really bad. We will pretend the symbol cannot be preempted and then at the last minute use a got.

So lets start from the IR up and see how we can represent all cases. It would be particularly desirable that:

  • The access type can be decided from looking just at the GV, not the module or codegen flags. The codegen flag would be just for "mov $foo or lea foo(%rip)", not for using a GOT or not.
  • The IR linker just works.
  • The same IR has the same semantics in COFF, MachO and ELF.
  • Simpler IR is used for the common cases.
  • Keep the property that one only has to look at the linkage to decide if a function can be inlined or not (thanks to John McCall for suggesting this one).
  • A visibility attribute in C ends up with that visibility in the .o.

Lets first consider a how to design a IR that provides that and then see how to map C and command line options to it. We need IR to cover 4 cases

  • Direct access to definition.
  • GOT access to definition.
  • Direct access to declaration.
  • GOT access to declaration.

I propose that the IR we use is:

For

@a = global i32 42
define i32* @f() { ret i32* @a }

we know @a cannot be preempted, so we produce

leaq a(%rip), %rax

For

@a = external global i32
define i32* @f() { ret i32* @a }

we don't know where @a is, so we produce

movq a@GOTPCREL(%rip), %rax

For

@a = preemptable global i32 42
define i32* @f() { ret i32* @a }

we have a new linkage type to mark a definition as preemptable. GVs with preemptable linkage
must have default visibility. We produce:

movq a@GOTPCREL(%rip), %rax

For

@a = external_local global i32
define i32* @f() { ret i32* @a }

We have a new linkage for assuming a declaration is in the current dso. We produce

leaq a(%rip), %rax

For

@a = protected external_local global i32
define i32* @f() {  ret i32* @a }

We produce

leaq a(%rip), %rax
.protected a

Declarations with non-default visibility are required to be external_local.

New, lets see how the various file formats would map from C to the above:

ELF:

  • If there is a non-default visibility attribute:
    • Declarations become extern_local if the visibility is not default
    • Use the visibility
  • -fPIC. Symbols can be preemepted, so use the preemptable linkage.
  • No option or -fPIE: Symbols cannot be preempted:
    • A dllimport declartion is external, others are external_local.

MachO:

  • -fPIC: Just like today.
  • static (is that supported?): declarations are external_local.

COFF:

  • A dllimport declaration is external, others are external_local.

Sorry, I know that this is a lot more work, but I am willing to help since we have to fix this representation deficiency once and for all. I think the steps would be

  • Introduce the preeptible and extern_local linkages and verifier checks for them.
  • Codegen them as defined above, regardless of -relocation-model.
  • Change clang to use the new translation.
  • Upgrade declarations with non default visibility to extern_local.
  • Verify that all non default visibility declarations are extern_local.
  • Consolidate all of the pic/pie into a single option that says if the code is position independent or not. The only difference is using "mov $foo" or "lea foo(%rip)"
rjmccall edited edge metadata.May 6 2016, 9:42 AM

Can you explain how "preemptable" is different from "weak"?

I like the idea of having a linkage that says that the object is known to be defined within the current linkage unit; that's certainly cleaner than the current workaround, which is just to give a declaration hidden visibility.

Sorry for the long reply, but I think it is critical we get this right.

When a variable/function is defined in a shared library, either the library or the main program has to use a got. The other one can then use direct accesses.

This seems partially correct as it does not addess this case. Let's say the shared library and the executable define the global variable but initialize it differently. Then, if the shared library does not use the GOT it is not possible to suddenly create a copy relocation in the shared library (that is already built first and is also used by other executables) and you are now accessing a global variable with the incorrect initialization. The global definition in the executable must take precedence and that is not happening here.

In every non-elf system, access from inside the library are direct (lea foo(%rip)) and from the main binary indirect (mov foo@GOTPCREL(%rip)). The rationale being that access from within the library is more common and normally the library wants the guarantee that its own functions/variables are not preempted. I think it is critical to support something like this in ELF.

On ELF, quite unfortunately IMHO, the default is to allow preemption. If something can be preempted, you need to use a got, if you need to use a got the main executable may as well create a copy relocation and at least it will be able to use a lea.

Right, this is the rationale for this patch. It is also based on the simple observation that the executable's definition takes precedence. Another way of looking at this - this is what exactly happens in non-PIE mode. Why not preserve the same behavior in PIE mode?

So the first question is: In your testcase/benchmark, is preemption required or is the variable/function more frequently accessed from the main binary than from the library that defines it? If not, you would probably get even better performance by disallowing preemption and have the library access its own symbols directly and main binary uses a got.

For us atleast, we have the case that most of the globals end up in the executable and we do not want to conservatively use the GOT because it was declared external in the modules accessing it. In non-PIE mode, which has the identical problem, copy relocations are used. Extend that to PIE mode.

I understand that the extra option '-mpiecopyrelocs' makes this complicated. This is only to support linkers that do not support copy relocations for PIE yet. At some point, when linkers support this, this option can be deleted and this behavior made the default.

If you do have a case where the main binary is the one with the most frequent accesses (or preemption is required), I think that is fine. With ELF we should be able to support both cases, we just have to be really careful.

Yes, this is our case and that is why we are really interested in solving this.

Right now the llvm support for preemption is also really bad. We will pretend the symbol cannot be preempted and then at the last minute use a got.

For global variables atleast, with ELF and X86, llvm support seems identical to GCC support regarding symbol preemption.

So lets start from the IR up and see how we can represent all cases. It would be particularly desirable that:

  • The access type can be decided from looking just at the GV, not the module or codegen flags. The codegen flag would be just for "mov $foo or lea foo(%rip)", not for using a GOT or not.
  • The IR linker just works.
  • The same IR has the same semantics in COFF, MachO and ELF.
  • Simpler IR is used for the common cases.
  • Keep the property that one only has to look at the linkage to decide if a function can be inlined or not (thanks to John McCall for suggesting this one).
  • A visibility attribute in C ends up with that visibility in the .o.

    Lets first consider a how to design a IR that provides that and then see how to map C and command line options to it. We need IR to cover 4 cases
  • Direct access to definition.
  • GOT access to definition.
  • Direct access to declaration.
  • GOT access to declaration.

    I propose that the IR we use is:

    For ` @a = global i32 42 define i32* @f() { ret i32* @a } ` we know @a cannot be preempted, so we produce

    ` leaq a(%rip), %rax `

    For ` @a = external global i32 define i32* @f() { ret i32* @a } ` we don't know where @a is, so we produce ` movq a@GOTPCREL(%rip), %rax `

    For ` @a = preemptable global i32 42 define i32* @f() { ret i32* @a } ` we have a new linkage type to mark a definition as preemptable. GVs with preemptable linkage must have default visibility. We produce: ` movq a@GOTPCREL(%rip), %rax `

    For ` @a = external_local global i32 define i32* @f() { ret i32* @a } ` We have a new linkage for assuming a declaration is in the current dso. We produce ` leaq a(%rip), %rax `

    For ` @a = protected external_local global i32 define i32* @f() { ret i32* @a } ` We produce ` leaq a(%rip), %rax .protected a ` Declarations with non-default visibility are required to be external_local.

    New, lets see how the various file formats would map from C to the above:

    ELF:
  • If there is a non-default visibility attribute:
    • Declarations become extern_local if the visibility is not default
    • Use the visibility
  • -fPIC. Symbols can be preemepted, so use the preemptable linkage.
  • No option or -fPIE: Symbols cannot be preempted:

Maybe you covered this case else where, but we are really interested in extern symbol access with -fPIE. Whether it is really external or not at link time is the right question. It is hard to answer that at compile time and too conservative to commit to GOT access.

    • A dllimport declartion is external, others are external_local.

      MachO:
  • -fPIC: Just like today.
  • static (is that supported?): declarations are external_local.

    COFF:
  • A dllimport declaration is external, others are external_local.

    Sorry, I know that this is a lot more work, but I am willing to help since we have to fix this representation deficiency once and for all. I think the steps would be
  • Introduce the preeptible and extern_local linkages and verifier checks for them.
  • Codegen them as defined above, regardless of -relocation-model.
  • Change clang to use the new translation.
  • Upgrade declarations with non default visibility to extern_local.
  • Verify that all non default visibility declarations are extern_local.
  • Consolidate all of the pic/pie into a single option that says if the code is position independent or not. The only difference is using "mov $foo" or "lea foo(%rip)"
majnemer added inline comments.
lib/Target/X86/X86Subtarget.cpp
94

What if it is a global alias to a function?

110

Ditto.

The semantic argument against using copy relocations by default is that they hard-code the size of the variable at link time. What's the story here for that?

The semantic argument against using copy relocations by default is that they hard-code the size of the variable at link time. What's the story here for that?

I am repeating a comment I already made to keep everything in Phabricator.

I am no copy relocations expert but like it or not, copy relocations are used today by default in non-PIE mode. What kind of bugs do you see with copy relocations? This comes to my mind https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68016 but this problem exists without copy relocations too.

If it is not already clear, all I am saying is non-PIE code gen knows a way to avoid extra loads for global accesses. Let's exploit that for PIE too!

rnk edited edge metadata.May 6 2016, 12:31 PM

Can you explain how "preemptable" is different from "weak"?

I am assuming "preemtable" means that a preemtable symbol can be rewritten by the dynamic loader to refer to something other than the locally available definition. We should standardize on this term (or "interposable" which I like), but calling this kind of thing a "weak" symbol confuses me.


Both preemtable and extern_local sound like great ideas, but how should we tell the compiler to generate copy relocations for a declaration of a GV? That's what this patch is really about. It seems to me that copy relocation generation is really only safe if you know something about the way that the current TU is going to be linked: i.e. it's going into the executable and the linker supports copy relocations. Those seem like properties of the module to me, which is why we were going with module flags. We can invent a new linkage for this instead. Maybe call it extern_copy. When using appropriate flags, clang could apply it to all GV declarations. Do people like this better?

It also allows the user to avoid generating copy relocations against certain GVs that might change size. The fact that the size of a GV is part of its ABI in ELF was personally very surprising to me:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68016
https://github.com/google/sanitizers/issues/619

joerg added a subscriber: joerg.May 6 2016, 12:42 PM

This looks like a huge step in the wrong direction to me. Copy relocations are one of the worst misfeatures of ELF. Please do not add any new dependencies.

That said, if the original motivation is simplifying access to variables of the main program (not in shared libraries), the linker can already optimize them.

rnk added a comment.May 6 2016, 1:36 PM

This looks like a huge step in the wrong direction to me. Copy relocations are one of the worst misfeatures of ELF. Please do not add any new dependencies.

I don't see any other way for us to regain the performance that copy relocations give us. If you have other ideas, we're open to them.

That said, if the original motivation is simplifying access to variables of the main program (not in shared libraries), the linker can already optimize them.

I'm assuming you mean it can turn the GOT load into an RIP-relative lea, thereby saving a load? That still creates unnecessary register pressure. Sri can say more, but I think it is actually important for some of our applications (compression) to save the lea and reduce register pressure.

joerg added a comment.May 6 2016, 2:18 PM
In D19995#423829, @rnk wrote:

This looks like a huge step in the wrong direction to me. Copy relocations are one of the worst misfeatures of ELF. Please do not add any new dependencies.

I don't see any other way for us to regain the performance that copy relocations give us. If you have other ideas, we're open to them.

It's essentially PROTECTED again, just that protected is kind of broken...

That said, if the original motivation is simplifying access to variables of the main program (not in shared libraries), the linker can already optimize them.

I'm assuming you mean it can turn the GOT load into an RIP-relative lea, thereby saving a load? That still creates unnecessary register pressure. Sri can say more, but I think it is actually important for some of our applications (compression) to save the lea and reduce register pressure.

Are we talking about i386 or x86_64? For the latter, it should only make a difference if you can often fold indexing into the LEA. I don't think it changes the register pressure itself at all.

Hi Rafael, IIUC, for non-pie case, LLVM is already assuming linker supports
copyreloc and already takes advantage of it to avoid generating references
via GOT. This patch just extends that for pie. It should not change
behavior regarding preemption, no?

thanks,

David

To be clear, assuming copyrelocs (support in linker) allows compiler to
generate efficient code (speculatively) to access external globals which
compiler has no idea where they are going to be defined. For most of the
cases, the most of globals end up in the main program, so copyrelocs is
only a fallback mechanism in case that is not the case.

Yes, copy relocs exposes the data layout of shared lib globals -- but this
is usually not an issue for programs that are frequently built/released (on
the other hand, shared libraries are less likely to change).

Also note this is not something new introduced by this patch. Non-pie case
already does it.

David

joerg added a comment.May 7 2016, 11:45 AM

Also note this is not something new introduced by this patch. Non-pie case
already does it.

One reason a lot of OS people are pushing for PIE is to get rid of copy relocations. Again, please do not add another instance of them. The behavior can be obtained without any such regressions.

It seems reasonable to me to have an option enabling the use of copy relocations for globals, and I agree with David that that option should logically be applicable to PIE. I would also tend to agree with Joerg that that option should default to off, even in non-PIC modes, but it's not my call.

RE: the performance impact of relying on the linker to optimize accesses: it certainly can increase register pressure, but it shouldn't in the most common cases where the access sequence ends in putting a new value in a GPR anyway (usually either the address of the variable or something loaded from it). More importantly, I don't really understand how you can avoid this without compiling code assuming that all globals can be copy-relocated, since the *compiler* doesn't know whether an external symbol is defined as protected or default. (Unless that's exactly what we're doing — are people really this fast-and-loose in ELF land?)

RE: the semantic problems with copy relocations exposing the layout of globals: yes, binary compatibility problems usually don't exist when you recompile all the dependent code. I can understand why system maintainers wouldn't be thrilled with the idea that certain kinds of updates to system libraries can't be made without recompiling the entire user space, though.

joerg added a comment.May 8 2016, 9:30 PM

It seems reasonable to me to have an option enabling the use of copy relocations for globals,
and I agree with David that that option should logically be applicable to PIE. I would also tend
to agree with Joerg that that option should default to off, even in non-PIC modes, but it's not my call.

Non-PIC use of copy relocations is practically unavoidable and that's not what is under discussion.

RE: the performance impact of relying on the linker to optimize accesses: it certainly can increase
register pressure, but it shouldn't in the most common cases where the access sequence ends in
putting a new value in a GPR anyway (usually either the address of the variable or something loaded
from it).

Evidence for that? In PIC/PIE mode, obtaining the address of a variable is always either a load
(generic case via GOT) or at least an address computation (%rip + offset). Even for the second case,
I don't think full folding without scratch register is a very common case.

More importantly, I don't really understand how you can avoid this without compiling code assuming
that all globals can be copy-relocated, since the *compiler* doesn't know whether an external symbol
is defined as protected or default. (Unless that's exactly what we're doing — are people really this
fast-and-loose in ELF land?)

The (sane) default behavior for PIC and PIE code is to access external symbols with default visibility
via the GOT. I find it perfectly reasonable to request symbols to be marked as protected if they should
still be externally visible AND get optimized. It only really matters semantically when also using --export-dynamic
anyway.

RE: the semantic problems with copy relocations exposing the layout of globals: yes, binary compatibility
problems usually don't exist when you recompile all the dependent code. I can understand why system
maintainers wouldn't be thrilled with the idea that certain kinds of updates to system libraries can't be
made without recompiling the entire user space, though.

Yes, this is exactly one of the core reasons why I am adamant against supporting copy relocations for PIE.
It forces real performance degradations for library interfaces, that often have a far more measurable impact.

It seems reasonable to me to have an option enabling the use of copy relocations for globals,
and I agree with David that that option should logically be applicable to PIE. I would also tend
to agree with Joerg that that option should default to off, even in non-PIC modes, but it's not my call.

Non-PIC use of copy relocations is practically unavoidable

This is not even slightly true. There is absolutely nothing preventing you from using a GOT-style relocation in non-PIC mode. Even if you didn't have linker support for a non-relative relocation to a GOT entry (on platforms where it matters), you can easily fake one up with weak symbols.

I mean, it doesn't really affect me, so feel free to over-use copy relocations, but you don't actually have to.

RE: the performance impact of relying on the linker to optimize accesses: it certainly can increase
register pressure, but it shouldn't in the most common cases where the access sequence ends in
putting a new value in a GPR anyway (usually either the address of the variable or something loaded
from it).

Evidence for that? In PIC/PIE mode, obtaining the address of a variable is always either a load
(generic case via GOT) or at least an address computation (%rip + offset). Even for the second case,
I don't think full folding without scratch register is a very common case.

The access sequence doesn't end with constructing the address. If you're loading from the address or moving it into a register for some more complicated action, you're clobbering a register anyway and can just use that as your scratch register. That's only not true when (1) you're loading into a non-GPR, e.g. an xmm register, or (2) you're just storing to the variable. As a general rule, those cases are a lot less common, especially with an external variable.

More importantly, I don't really understand how you can avoid this without compiling code assuming
that all globals can be copy-relocated, since the *compiler* doesn't know whether an external symbol
is defined as protected or default. (Unless that's exactly what we're doing — are people really this
fast-and-loose in ELF land?)

The (sane) default behavior for PIC and PIE code is to access external symbols with default visibility
via the GOT.

I completely agree. I don't see why this should even be specific to PIC, but like I've said before, that's not really my call to make.

RE: the semantic problems with copy relocations exposing the layout of globals: yes, binary compatibility
problems usually don't exist when you recompile all the dependent code. I can understand why system
maintainers wouldn't be thrilled with the idea that certain kinds of updates to system libraries can't be
made without recompiling the entire user space, though.

Yes, this is exactly one of the core reasons why I am adamant against supporting copy relocations for PIE.
It forces real performance degradations for library interfaces, that often have a far more measurable impact.

Right.

My general take on copy relocations is that they're a very useful feature that should have been reserved for specific use cases as opposed to being generally exploited to optimize executables over libraries. (Interesting use case for copy relocations: libraries that export version-specific constant values that are frequently accessed by clients. For example, you could imagine libc exporting the system page size.)

I am assuming "preemtable" means that a preemtable symbol can be rewritten by the dynamic loader to refer to something other than the locally available definition. We should standardize on this term (or "interposable" which I like), but calling this kind of thing a "weak" symbol confuses me.

I like interposable too. Thanks.


Both preemtable and extern_local sound like great ideas, but how should we tell the compiler to generate copy relocations for a declaration of a GV?

That is extern_local. That tells llc to assume the symbol is local.
Not from the example that it produces "leaq a(%rip), %rax", which will
create a copy relocation is the assumption in wrong.

It seems to me that copy relocation generation is really only safe if you know something about the way that the current TU is going to be linked: i.e. it's going into the executable and the linker supports copy relocations.

and the DSO that defines is is ready for it being preempted.

But not that copy relocation is one way of handling an extern_local.
The other one is just resolving the relocation in the the definition
ends up really being in the current DSO.

Those seem like properties of the module to me, which is why we were going with module flags. We can invent a new linkage for this instead. Maybe call it extern_copy. When using appropriate flags, clang could apply it to all GV declarations. Do people like this better?

See above why I don't think we need a copy linkage.

It also allows the user to avoid generating copy relocations against certain GVs that might change size. The fact that the size of a GV is part of its ABI in ELF was personally very surprising to me:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68016
https://github.com/google/sanitizers/issues/619

Yes. That is why I suggested having dllimport control this on a decl
by decl basis. By default you get extern_local which can result in a
copy relocation. If you don't want that for 'foo', you can get an
extern by marking it dllimport.

Cheers,
Rafael

I will try coding the interposable/extern_local patch tomorrow.
Hopefully it should clarify this discussion a bit.

Cheers,
Rafael

emaste added a subscriber: emaste.Jan 24 2017, 8:57 AM