Page MenuHomePhabricator

[LLD][ELF] Add the -z ifunc-noplt option
ClosedPublic

Authored by markj on May 6 2019, 2:49 PM.

Details

Summary

When the option is configured, ifunc calls do not go through the PLT;
rather, they appear as regular function calls with relocations
referencing the ifunc symbol, and the resolver is invoked when
applying the relocation. This is intended for use in freestanding
environments where text relocations are permissible and is incompatible
with the -z text option. The option is motivated by ifunc usage in the
FreeBSD kernel, where ifuncs are used to elide CPU feature flag bit
checks in hot paths. Instead of replacing the cost of a branch with that
of an indirect function call, the -z ifunc-noplt option is used to ensure
that ifunc calls carry no hidden overhead relative to normal function
calls.

Test Plan:
I added a couple of regression tests and tested the FreeBSD kernel
build using the latest lld sources.

To demonstrate the effects of the change, I used a micro-benchmark
which results in frequent invocations of a FreeBSD kernel ifunc. The
benchmark was run with and without IBRS enabled, and with and without
-zifunc-noplt configured. The observed speedup is small and consistent,
and is significantly larger with IBRS enabled:

https://people.freebsd.org/~markj/ifunc-noplt/noibrs.txt
https://people.freebsd.org/~markj/ifunc-noplt/ibrs.txt

Diff Detail

Repository
rL LLVM

Event Timeline

markj created this revision.May 6 2019, 2:49 PM
markj edited the summary of this revision. (Show Details)May 6 2019, 2:50 PM
markj added reviewers: emaste, arichardson, ruiu, MaskRay.
ruiu added a comment.May 6 2019, 8:24 PM

Generally looking good.

lld/ELF/Driver.cpp
891 ↗(On Diff #198344)

According to the convention, this should be ZIfuncNoplt.

lld/docs/ld.lld.1
552–553 ↗(On Diff #198344)

IIRC, text relocations are not preferred but actually permitted in most environment, no? I'd mention that this option is intended to be used for the kernel where an existence of text relocation doesn't matter but the cost of PLT matters.

grimar added a subscriber: grimar.May 7 2019, 2:41 AM

Do you want to add a DT_TEXTREL tag?

// SyntheticSections.cpp
  if (!Config->ZText)
    addInt(DT_TEXTREL, 0);
MaskRay added inline comments.May 7 2019, 3:34 AM
lld/test/ELF/gnu-ifunc-noplt-i386.s
9 ↗(On Diff #198344)

Where is the IRELATIVE relocation? And I don't think the ordering matters.

43 ↗(On Diff #198344)

Not sure if you need bar2@plt zed2@plt. The normal PLT entries are checked by numerous tests.

lld/test/ELF/gnu-ifunc-noplt.s
9 ↗(On Diff #198344)

Same question goes here.

MaskRay added inline comments.May 7 2019, 6:06 AM
lld/test/ELF/gnu-ifunc-noplt.s
38 ↗(On Diff #198344)

You FreeBSD llvm-objdump is old XD

The misindentation made me unhappy so I fixed it in rL358405.

emaste added inline comments.May 7 2019, 7:16 AM
lld/docs/ld.lld.1
552–553 ↗(On Diff #198344)

I think that's fair - by default the linker won't permit the creation of .text relocations but they are permitted by the runtime environment.

markj updated this revision to Diff 198523.May 7 2019, 1:45 PM
markj marked 9 inline comments as done.

Address feedback:

  • Set DT_TEXTREL
  • Rename the config flag to match Capitalization Conventions
  • Clarify some text in the man page update
  • Fix test comments and indentation
markj added inline comments.May 7 2019, 1:46 PM
lld/test/ELF/gnu-ifunc-noplt-i386.s
9 ↗(On Diff #198344)

My mistake, sorry. I left a few TODO comments and forgot to fix them before creating the review. There should not be any IRELATIVE relocations.

43 ↗(On Diff #198344)

I just reused them from gnu-ifunc-plt.s. If you prefer I can remove them, though I need to add -pie or so to ensure that lld doesn't try to resolve the relocations statically.

lld/test/ELF/gnu-ifunc-noplt.s
38 ↗(On Diff #198344)

Not too old :)

> llvm-objdump --version
LLVM (http://llvm.org/):
  LLVM version 8.0.0
ruiu accepted this revision.May 7 2019, 7:12 PM

LGTM

This revision is now accepted and ready to land.May 7 2019, 7:12 PM
MaskRay added inline comments.May 7 2019, 7:54 PM
lld/test/ELF/gnu-ifunc-noplt.s
38 ↗(On Diff #198344)

My change will be included by llvm 9.0.0 ;-) Since you've cloned the monorepo, you may run ninja -C your/build llvm-objdump and use that llvm-objdump to update the tests.

I still suggest fixing the misaligned instructions before you commit...

MaskRay requested changes to this revision.May 7 2019, 8:12 PM

FreeBSD makes use of the option when linking x86 kernels, which now make heavy use of ifuncs in hot paths. Without the option, each ifunc results in the generation of a PLT entry, so calls needlessly incur the extra overhead of an indirect jump. With the option, ifunc calls have the same cost as ordinary function calls.

Oh I recall this one and find your efforts to push forward this.

I remember I asked if you got numbers of the improvement and you didn't have an answer then. I still hope you can share the benchmark to justify this is a useful feature. According to my understanding, -z ifunc-noplt will introduce many text relocations (a kind of dynamic relocations), that will slow down the dynamic loader. The overhead resolving the text relocation pays off after the function has been called a few number of times. Surely you have hot paths, but many others are cold. The problem of -z ifunc-noplt is that it is too large a hammer - it applies to the whole module.

If you extend the usage to userspace applications you may have to tune the runtime code of static linked programs.

If you want a fast memcpy, I wonder why you can't define a global function pointer: void (*fast_memcpy)(void*,const void*,size_t) __attribute__((visibility("hidden"))) = slow_memcpy; and initializes it to a fast implementation at runtime.

Compared with the usual ifunc: GOT+PLT, this approach is similar to -fno-plt. I'd like to know why you think it doesn't meet your needs and you really need -z ifunc-noplt.

This revision now requires changes to proceed.May 7 2019, 8:12 PM

There is another issue: this feature is arch-specific and probably only applies on x86:

// ELF/Relocations.cpp#L880
    } else if (RelType Rel = Target->getDynRel(Type)) {
      In.RelaDyn->addReloc(Rel, &Sec, Offset, &Sym, Addend, R_ADDEND, Type);

On non-x86, you probably can't find an appropriate dynamic relocation type that resolves your desired text relocations. This looks to me a really hacky feature (linker support + textrel special case + more dynamic relocation type handling) that is limited to x86. Its benefit is also unclear (I can't find benchmarks in several of your FreeBSD patches). On the other hand, the alternative (a global function pointer) seems equally good.

markj added a comment.May 8 2019, 8:48 AM

FreeBSD makes use of the option when linking x86 kernels, which now make heavy use of ifuncs in hot paths. Without the option, each ifunc results in the generation of a PLT entry, so calls needlessly incur the extra overhead of an indirect jump. With the option, ifunc calls have the same cost as ordinary function calls.

Oh I recall this one and find your efforts to push forward this.

I remember I asked if you got numbers of the improvement and you didn't have an answer then. I still hope you can share the benchmark to justify this is a useful feature.

See below.

According to my understanding, -z ifunc-noplt will introduce many text relocations (a kind of dynamic relocations), that will slow down the dynamic loader. The overhead resolving the text relocation pays off after the function has been called a few number of times. Surely you have hot paths, but many others are cold. The problem of -z ifunc-noplt is that it is too large a hammer - it applies to the whole module.

In our use-case the relocations are applied at boot time. My kernel has 3815 text relocations, all for ifuncs, which take 418672 cycles to resolve as measured by rdtsc on a Xeon E5-2630v3. The overhead is negligible and I expect it will stay that way.

I agree that -zifunc-noplt is not suitable for userspace applications, which must decide between the overhead of PLT calls and text relocations. In the kernel or other freestanding environments, text relocations do not have the same downsides, so the tradeoff is different.

If you extend the usage to userspace applications you may have to tune the runtime code of static linked programs.

If you want a fast memcpy, I wonder why you can't define a global function pointer: void (*fast_memcpy)(void*,const void*,size_t) __attribute__((visibility("hidden"))) = slow_memcpy; and initializes it to a fast implementation at runtime.

This still requires an indirect branch and has the added disadvantage that control flow is directed by pointers residing in writable memory.

Compared with the usual ifunc: GOT+PLT, this approach is similar to -fno-plt. I'd like to know why you think it doesn't meet your needs and you really need -z ifunc-noplt.

A function pointer still has extra, unnecessary overhead. We use ifuncs as a micro-optimization to avoid having CPU feature bit tests in commonly used functions. For instance, our copyout() implementation contains special logic for processors supporting SMAP. The overhead of testing the CPU feature bit is not high, but it is measurable in micro-benchmarks, and the test consumes CPU resources (extra I-cache footprint, branch predictor cache, etc.) to compute a result that is known at boot-time. Over time, as vendors introduce new CPU features, our code accrues more and more of these tests. ifuncs are an attractive mitigation for this problem, but if they use the PLT than we are still sacrificing CPU resources needlessly: the PLT occupies space in the instruction cache and the indirect call targets occupy space in a cache. -zifunc-noplt completely removes these overheads. I understand that they are small, but they accumulate as the code base grows and introduce cognitive overhead as programmers must think about the hidden cost behind what looks like an ordinary C function call.

There is another issue: this feature is arch-specific and probably only applies on x86:

// ELF/Relocations.cpp#L880
    } else if (RelType Rel = Target->getDynRel(Type)) {
      In.RelaDyn->addReloc(Rel, &Sec, Offset, &Sym, Addend, R_ADDEND, Type);

On non-x86, you probably can't find an appropriate dynamic relocation type that resolves your desired text relocations. This looks to me a really hacky feature (linker support + textrel special case + more dynamic relocation type handling) that is limited to x86. Its benefit is also unclear (I can't find benchmarks in several of your FreeBSD patches). On the other hand, the alternative (a global function pointer) seems equally good.

I tested the feature on arm64 and it works there as well. You are right that it requires support for static relocations in the kernel linker, but I do not see this to be a problem. We already implement a small "static" linker in the kernel (sys/kern/link_elf_obj.c) to handle the fact that loadable kernel modules are relocatable .o files instead of shared libs on some platforms (and the reason for this is again to avoid the PLT). IMO it is just another hint that the feature is not suitable for vanilla userspace.

To be clear, I do agree that this is hacky. It can not be used unless the application has some control over how it is linked; I tried to indicate this in the man page description. But, the implementation is extremely simple, and I believe the goal is conceptually reasonable. In the future, if the feature becomes too much work to maintain, it can be removed without affecting application correctness. In the meantime I intend to be available to support the patch if needed and refine it if possible.

Regarding benchmarks, I do not believe the patch will have an impact on any interesting macro-benchmark and did not attempt to measure. For the micro-benchmark, I used a program whose loop copies data out of the kernel at a high rate. In particular, it calls the copyout() ifunc frequently. I retried the test with the latest LLD sources. The code, data and summaries are here: https://people.freebsd.org/~markj/ifunc/

I tested with and without IBRS enabled, mostly to demonstrate that the optimization has an impact on microarchitectural resource usage. In particular, with IBRS enabled the improvement from -zifunc-noplt is more pronounced. The summaries show system CPU time consumed by the test program. "Patched" means -zifunc-noplt is configured.

https://people.freebsd.org/~markj/ifunc/ibrs.txt
https://people.freebsd.org/~markj/ifunc/noibrs.txt

The improvement is small and consistent.

MaskRay accepted this revision.May 8 2019, 8:07 PM

Thanks for your explanation. I wanted to make sure the caveats of -z ifunc-noplt have been considered, the alternatives are compared before you decide to push forward these efforts :)

Over time, as vendors introduce new CPU features, our code accrues more and more of these tests. ifuncs are an attractive mitigation for this problem,

I will be glad if this helps some code organization problems! I did worry -z ifunc-noplt was proposed as it was a "fancy" feature. I asked musl guys, their feeling: "it's even worse on linux where they do this kind of hack; then happily make 10 abstraction layers; i don't get where folks get the idea to pour effort into wacky, costly microoptimizations like this when they have HUGE high-level performance problems". But they may not be aware of the FreeBSD situation.

Since you've proved this has been well-considered, I shall not block you from moving forward :)

I tested the feature on arm64 and it works there as well.

We already implement a small "static" linker in the kernel (sys/kern/link_elf_obj.c) to handle the fact that loadable kernel modules are relocatable .o files instead of shared libs on some platforms (and the reason for this is again to avoid the PLT).

Happy to know this is not x86 only. This indeeds requires the dynamic loader support of some "static" relocation types.

In the description,

Test Plan: I added a couple of regression tests and tested the FreeBSD kernel build using the latest lld sources.

If there are useful links, you may change this part of the commit description to include them.

lld/test/ELF/gnu-ifunc-noplt-i386.s
2 ↗(On Diff #198523)

The -pc-linux part in i686-pc-linux does not matter. You don't have to call it Linux. Something like -triple=i686-pc-freebsd13.0 should also work. There are a few other tests using the freebsd triple, not many currently, but I hope there will be more in the future for other OSes or os-neutral -triple=x86_64.

% rg -l freebsd | wc -l
49
% rg -l linux | wc -l  
1346
This revision is now accepted and ready to land.May 8 2019, 8:07 PM
MaskRay added inline comments.May 8 2019, 9:14 PM
lld/ELF/SyntheticSections.cpp
1238 ↗(On Diff #198523)

Oh, this should probably be restored.

In http://lists.llvm.org/pipermail/llvm-dev/2018-August/125490.html , I think your opinion was that -z text -z ifunc-noplt = error.

ruiu added a comment.May 9 2019, 4:18 AM

Mark,

Thank you for the explanation. I tend to oppose to add micro-optimizations to the ABI, but I agree that this particular optimization seems reasonable. I can imagine that with this feature you can use ifuncs in the kernel more liberally, which would make more difference in some test case.

markj updated this revision to Diff 199198.May 12 2019, 8:45 PM
markj marked 4 inline comments as done.

Address review feedback:

  • Prohibit the combination of -ztext and -zifunc-noplt, add a regression test.
  • Use a non-Linux OS in the triple for ifunc-noplt tests.
markj added a comment.May 12 2019, 8:46 PM

Thanks for your explanation. I wanted to make sure the caveats of -z ifunc-noplt have been considered, the alternatives are compared before you decide to push forward these efforts :)

Certainly, I should have provided more justification in the review description. I appreciate the skepticism. :)

Over time, as vendors introduce new CPU features, our code accrues more and more of these tests. ifuncs are an attractive mitigation for this problem,

I will be glad if this helps some code organization problems! I did worry -z ifunc-noplt was proposed as it was a "fancy" feature. I asked musl guys, their feeling: "it's even worse on linux where they do this kind of hack; then happily make 10 abstraction layers; i don't get where folks get the idea to pour effort into wacky, costly microoptimizations like this when they have HUGE high-level performance problems". But they may not be aware of the FreeBSD situation.

I tend to agree with this view. FreeBSD is certainly not optimized to the point where a change like this is crucial. My two counter-points here would be:

  • the patch is very small, and
  • the patch is an optional optimization, i.e., if you decide to remove it in the future, FreeBSD will not lose any functionality.

If either of these statements were false, I would be much more reticent about trying to push the change upstream.

[...]

In the description,

Test Plan: I added a couple of regression tests and tested the FreeBSD kernel build using the latest lld sources.

If there are useful links, you may change this part of the commit description to include them.

Thanks for the feedback. I will update the review description to provide a better commit message.

Thank you for the explanation. I tend to oppose to add micro-optimizations to the ABI, but I agree that this particular optimization seems reasonable. I can imagine that with this feature you can use ifuncs in the kernel more liberally, which would make more difference in some test case.

Right, my concern is with ifunc usage over time. The kernel started using ifuncs late last year and now we are up to 51 ifuncs, and I'm sure the number will keep growing.

lld/ELF/SyntheticSections.cpp
1238 ↗(On Diff #198523)

Ah, right, thank you. I added a check to readConfigs() instead and added a test to basic.s. Hopefully that's appropriate.

lld/test/ELF/gnu-ifunc-noplt.s
38 ↗(On Diff #198344)

Ok, but I believe I fixed the misalignment before your last comment here? I can't see any others.

markj edited the summary of this revision. (Show Details)May 12 2019, 9:19 PM
ruiu added inline comments.May 12 2019, 9:30 PM
lld/ELF/Driver.cpp
936 ↗(On Diff #199198)

Can you move this to checkOptions?

937 ↗(On Diff #199198)

remove "the"

markj updated this revision to Diff 199203.May 12 2019, 9:33 PM
markj marked 2 inline comments as done.

Address feedback:

  • Move check to checkOptions().
  • Remove "the".
ruiu added a comment.May 12 2019, 10:21 PM

I don't have any further comments. Fangrui, are you OK with this change?

lld/ELF/Driver.cpp
313 ↗(On Diff #199203)

Let's use the same wording as other error messages: "-z text and -z ifunc-noplt may not be used together"

markj updated this revision to Diff 199205.May 12 2019, 10:38 PM
markj marked an inline comment as done.

Address feedback:

  • Use consistent wording in the error message.
MaskRay accepted this revision.May 12 2019, 10:53 PM

LGTM

lld/test/ELF/gnu-ifunc-noplt.s
24 ↗(On Diff #199205)

// DISASM-NEXT: 201000: c3 retq

I believe you are still using an older llvm-objdump
After rL358474, you shall see

0000000000001001 bar:
    1001: c3                            retq

0000000000001002 _start:
    1002: e8 29 00 00 00                callq   41 <foo@plt>                                                   
    1007: e8 34 00 00 00                callq   52 <bar@plt>
    100c: e8 3f 00 00 00                callq   63 <bar2@plt>
    1011: e8 4a 00 00 00                callq   74 <zed2@plt>

A better test should probably use llvm-objdump --no-show-raw-insn as the hex bytes here are not useful.

MaskRay added inline comments.May 12 2019, 10:57 PM
lld/test/ELF/gnu-ifunc-noplt.s
4 ↗(On Diff #199205)

x86_64-unknown-freebsd and x86_64-pc-freebsd do the same thing but it might be worth keeping consistency in the same file :)

markj updated this revision to Diff 199269.May 13 2019, 8:56 AM
markj marked 4 inline comments as done.

Address feedback:

  • Use --no-show-raw-insn.
  • Use a consistent triple.
lld/test/ELF/gnu-ifunc-noplt.s
4 ↗(On Diff #199205)

Sigh, thanks.

24 ↗(On Diff #199205)

I see, sorry. I reformatted all of the output using an updated llvm-objdump and added --no-show-raw-insn.

markj updated this revision to Diff 199452.May 14 2019, 8:16 AM

Remove a command line that snuck in by accident.

This revision was automatically updated to reflect the committed changes.

Talked with Mark in an IRC channel and will commit on behalf of him :)