Page MenuHomePhabricator

Support Intel Control-flow Enforcement Technology
AcceptedPublic

Authored by ruiu on Mar 25 2019, 10:19 AM.

Details

Summary

This patch is based on https://reviews.llvm.org/D58102.

This patch adds Intel CET (Control-flow Enforcement Technology) support to
lld. The implementation follows the draft version of psABI which you can
download from https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI.

I added a comment about what Intel CET is, so I won't repeat it here, but in
summary, CET introduces a new restriction on indirect jump instructions so
that you can limit the places to which you can jump to using indirect jumps.

In order to use the feature, you need to compile source files with
-fcf-protection. CET is enabled only when all input files are compiled with
the flag. If you want to make sure that the feature is enabled, pass
-require-cet to lld.

CET-enabled executable files have two PLT sections, ".plt" and ".plt.sec".
For the details as to why we have two sections, please read this patch's
comment.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ruiu updated this revision to Diff 192744.Mar 28 2019, 4:56 PM
  • Generally simplified the code.
  • Separated CET-specific code from the regular code.
  • Fix bugs and add tests for .got.plt, .plt and .splt.
lld/ELF/Arch/X86.cpp
428

It is not useful, For code size, Is it better to write nothing here, or not use it at all ?

MaskRay added inline comments.Mar 28 2019, 10:52 PM
lld/ELF/Arch/X86.cpp
412

brend32 -> endbr32

414

template <class ELFT> class IntelCET : public X86_64<ELFT> {

IBT-enabled PLT can be used with the ILP32 programming model (x86-32 ABI). (If you don't like putting too much stuff in the initial commit, class IntelCET : public X86_64<ELF64LE> {.

lld/ELF/Arch/X86_64.cpp
571

brend32 -> endbr64

588

If I understand it correctly, IntelCET::writePlt writes the contents of ld.bfd's .plt.sec, then it doesn't need a PLT header. Can Symbol::getPltVA (Plt->getVA() + Plt->HeaderSize + PltIndex * Target->PltEntrySize;) compute a wrong address because it thinks HeaderSize is actually 0?

Disassembly of section .plt:  // counterpart in lld: IntelCET::writeIBTPlt

0000000000001020 <.plt>:
    1020:       ff 35 e2 2f 00 00       pushq  0x2fe2(%rip)        # 4008 <_GLOBAL_OFFSET_TABLE_+0x8>
    1026:       f2 ff 25 e3 2f 00 00    bnd jmpq *0x2fe3(%rip)        # 4010 <_GLOBAL_OFFSET_TABLE_+0x10>
    102d:       0f 1f 00                nopl   (%rax)
    1030:       f3 0f 1e fa             endbr64 
    1034:       68 00 00 00 00          pushq  $0x0
    1039:       f2 e9 e1 ff ff ff       bnd jmpq 1020 <.plt>
    103f:       90                      nop
    1040:       f3 0f 1e fa             endbr64 
    1044:       68 01 00 00 00          pushq  $0x1
    1049:       f2 e9 d1 ff ff ff       bnd jmpq 1020 <.plt>
    104f:       90                      nop
...
Disassembly of section .plt.sec:  // counterpart in lld: IntelCET::writePlt

0000000000001060 <putchar@plt>:
    1060:       f3 0f 1e fa             endbr64 
    1064:       f2 ff 25 ad 2f 00 00    bnd jmp QWORD PTR [rip+0x2fad]        # 4018 <putchar@GLIBC_2.2.5>
    106b:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

0000000000001070 <puts@plt>:
    1070:       f3 0f 1e fa             endbr64 
    1074:       f2 ff 25 a5 2f 00 00    bnd jmp QWORD PTR [rip+0x2fa5]        # 4020 <puts@GLIBC_2.2.5>
    107b:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
lld/ELF/InputFiles.cpp
594 ↗(On Diff #192744)

call flow control -> Control-Flow Enforcement ?

608 ↗(On Diff #192744)

using Elf_Nhdr = ... is more common nowadays.

(Besides, at some point, we need some IPLT + --require-cet tests.)

lld/ELF/Arch/X86.cpp
429

In the implementation of ld.bfd, I don't have 16 bytes of NOPs. I think in the ABI, .SPLT1 is the counterpart of .PLT1, there is .PLT0 (PLT header) but .SPLT0 doesn't exist.

Also, ld.bfd uses .plt.sec, not .splt. @hjl.tools Which is the preferred name for the second PLT section?

ruiu marked 10 inline comments as done.Mar 29 2019, 10:20 AM
ruiu added inline comments.
lld/ELF/Arch/X86.cpp
414

Actually the template is gone, I removed it because no code depended on ELFT. (Also no one really uses x32 ABI I guess? I thought that I read an article that Linux kernel developers are discussing deprecating x32 ABI.)

lld/ELF/InputFiles.cpp
594 ↗(On Diff #192744)

This is a general name and not specific to the Intel's implementation.

608 ↗(On Diff #192744)

Yeah, but we should convert all of them in lld at once.

ruiu updated this revision to Diff 192855.Mar 29 2019, 10:20 AM
ruiu marked 2 inline comments as done.
  • removed SPLT0 as it contained just two NOPs
  • added tests for --require-cet
ruiu updated this revision to Diff 193034.Mar 31 2019, 6:00 PM
  • rebase
MaskRay added inline comments.Mar 31 2019, 8:10 PM
lld/ELF/Arch/X86.cpp
414

Sorry I didn't notice that X86_64 had been de-templated.

lld/ELF/InputFiles.cpp
594 ↗(On Diff #192744)

My concern here is that llvm-objdump/GNU objdump and other tools/profilers would have to recognize two section names .plt.sec and .splt. Many tools use the section name to do special things, e.g.
@hjl.tools If the ABI can be changed to not deliberately differ from the implementation (or the other way round), it'd be great.

binutils/objdump.c:739

static bfd_boolean
is_significant_symbol_name (const char * name)
{
  return strncmp (name, ".plt", 4) == 0 || strcmp (name, ".got") == 0;
}
ruiu marked an inline comment as done.Mar 31 2019, 8:27 PM
ruiu added inline comments.
lld/ELF/InputFiles.cpp
594 ↗(On Diff #192744)

I'm not sure if I follow. Currently, in my patch, I use ".splt" and ".plt" as PLT section names, which I think as defined by the ABI. Are you saying that ".plt.sec" might be a correct name?

xiangzhangllvm added inline comments.Mar 31 2019, 8:35 PM
lld/ELF/InputFiles.cpp
594 ↗(On Diff #192744)

The SPEC not fix the plt's name, It full name is Second IBT-enabled PLT,
GCC name it as plt.sec. I name it splt for short at first.
It may be better to have the same name with GCC.

ruiu marked an inline comment as done.Mar 31 2019, 9:36 PM
ruiu added inline comments.
lld/ELF/InputFiles.cpp
594 ↗(On Diff #192744)

Ok, the I'll just follow the others. Is there anything other than this you invented a nonstandard stuff for this patch?

MaskRay added inline comments.Mar 31 2019, 10:12 PM
lld/ELF/InputFiles.cpp
594 ↗(On Diff #192744)

I'm not sure if I follow. Currently, in my patch, I use ".splt" and ".plt" as PLT section names, which I think as defined by the ABI. Are you saying that ".plt.sec" might be a correct name?

Sorry for not being clear. PLT sections do not have a dedicated section type and in practice they are usually recognized by the name .plt. My example above gave one case. If ld.bfd uses .plt.sec while we use .splt, the tools (objdump/llvm-mc/profiler/...) would have to understand the two names. The complexity could have been avoided if the linkers and the ABI agreed on the same name, say, .plt.sec.

I prefer .plt.sec to .splt because the same-prefix convention is already used in several other places to have fine-grained section types.

Thus, I suggest, if there is no strong reason to keep using .splt in the ABI, change it to what the implementations do: .plt.sec.

ruiu updated this revision to Diff 193044.Mar 31 2019, 10:49 PM
  • renamed .splt .plt.sec

writePltHeader should also be changed.

diff --git i/ELF/Arch/X86_64.cpp w/ELF/Arch/X86_64.cpp
index 668b40863..75f780f6b 100644
--- i/ELF/Arch/X86_64.cpp
+++ w/ELF/Arch/X86_64.cpp
@@ -144,7 +144,7 @@ void X86_64::writePltHeader(uint8_t *Buf) const {
   };
   memcpy(Buf, PltData, sizeof(PltData));
   uint64_t GotPlt = In.GotPlt->getVA();
-  uint64_t Plt = In.Plt->getVA();
+  uint64_t Plt = In.IBTPlt ? In.IBTPlt->getVA() : In.Plt->getVA();
   write32le(Buf + 2, GotPlt - Plt + 2); // GOTPLT+8
   write32le(Buf + 8, GotPlt - Plt + 4); // GOTPLT+16
 }

With these changes, I manage to run hello-world with clang {,-m32} -fuse-ld=lld -fcf-protection=branch a.c -Wl,--require-cet -o a; ./a

lld/ELF/Arch/X86_64.cpp
617

write32le(Buf + 5, I);

For glibc lazy binding, x86_64 uses relocation indices while x86_32 uses offsets.

lld/ELF/Writer.cpp
524

Since In.IBTPlt is named .plt while In.Plt is named .plt.sec.
Creating In.IBTPlt before In.Plt will give us this ordering: .plt .plt.sec (instead of .plt.sec .plt), which may look more natural.

No strong preference here. Up to you.

ruiu updated this revision to Diff 193048.Apr 1 2019, 12:53 AM
  • review comments
ruiu added a comment.Apr 1 2019, 12:53 AM

writePltHeader should also be changed.

diff --git i/ELF/Arch/X86_64.cpp w/ELF/Arch/X86_64.cpp
index 668b40863..75f780f6b 100644
--- i/ELF/Arch/X86_64.cpp
+++ w/ELF/Arch/X86_64.cpp
@@ -144,7 +144,7 @@ void X86_64::writePltHeader(uint8_t *Buf) const {
   };
   memcpy(Buf, PltData, sizeof(PltData));
   uint64_t GotPlt = In.GotPlt->getVA();
-  uint64_t Plt = In.Plt->getVA();
+  uint64_t Plt = In.IBTPlt ? In.IBTPlt->getVA() : In.Plt->getVA();
   write32le(Buf + 2, GotPlt - Plt + 2); // GOTPLT+8
   write32le(Buf + 8, GotPlt - Plt + 4); // GOTPLT+16
 }

With these changes, I manage to run hello-world with clang {,-m32} -fuse-ld=lld -fcf-protection=branch a.c -Wl,--require-cet -o a; ./a

Great! Thank you for verifying!

MaskRay accepted this revision.Apr 1 2019, 1:04 AM

".splt" is now In.Plt. Changes to Relocations.cpp are reverted as we no longer need them.

Is --force-cet the best option name? H. J. Lu pointed out that gold has -z ibtplt and other options, but I couldn't find these options in the binutils' repository. Is the option really implemneted to gold?

The description needs update. .splt -> .plt.sec, gold -> ld.bfd

ibtplt is implemented for ld.bfd but not for gold.

This revision is now accepted and ready to land.Apr 1 2019, 1:04 AM
xiangzhangllvm added inline comments.Apr 1 2019, 1:08 AM
lld/ELF/InputFiles.cpp
594 ↗(On Diff #192744)

Another difference with GNU-LD is that, GNU-LD can deal with the muli-.note.gnu.sections case, for example:

file.s :
 .section ".note.gnu.property", "a"
....
 .section ".note.gnu.property", "a"
....

In my earlier patch version I supported it, but at last, to simply the patch, I delete this function.
Because I sought compiler and assember should merge them to 1 section (usually with multi-desc) one day.
I am not sure my sought is right or not. I think it should let you know.

MaskRay added inline comments.Apr 1 2019, 1:26 AM
lld/ELF/Arch/X86.cpp
414

IBT (or IBTPlt) may be a better name than IntelCET as this class has nothing to do with the other part of CET: shstk.

lld/test/ELF/Inputs/i386-cet1.s
2 ↗(On Diff #193048)

Missing .align 4

lld/test/ELF/Inputs/i386-cet2.s
1 ↗(On Diff #193048)

Missing .align 8

lld/test/ELF/Inputs/i386-cet4.s
1 ↗(On Diff #193048)

In one of the tests, it may be good to check --require-cet works as intended if .note.gnu.property contains several notes.

ld -r may merge .note.gnu.property of several input objects into one.

ruiu edited the summary of this revision. (Show Details)Apr 1 2019, 1:27 AM
MaskRay added inline comments.Apr 1 2019, 1:27 AM
lld/test/ELF/Inputs/i386-cet2.s
1 ↗(On Diff #193048)

Sorry, x86-64-cet.s needs .align 8, not i386-cet*.s.

ruiu marked an inline comment as done.Apr 1 2019, 1:33 AM
ruiu added inline comments.
lld/test/ELF/Inputs/i386-cet1.s
2 ↗(On Diff #193048)

Do you need that? It's not wrong to add that but for the purpose of testing this file is enough, no?

MaskRay added inline comments.Apr 1 2019, 1:47 AM
lld/test/ELF/Inputs/i386-cet1.s
2 ↗(On Diff #193048)

Not required but that would make the tests feel more real.

xiangzhangllvm added inline comments.Apr 1 2019, 4:21 AM
lld/ELF/SyntheticSections.cpp
314

Here may waste 4 bytes of code size in 32-bit machine.

ruiu marked an inline comment as done.Apr 1 2019, 9:44 PM
ruiu added inline comments.
lld/ELF/SyntheticSections.cpp
314

Wasting 4 bytes is I think completely acceptable, but I think this is a correctness issue. I'll fix that.

ruiu updated this revision to Diff 193233.Apr 1 2019, 10:27 PM
ruiu edited the summary of this revision. (Show Details)
  • address review comments
ruiu marked an inline comment as done.Apr 1 2019, 10:27 PM

I noticed that the latest GNU objdump can symbolize ld.bfd emitted .plt .plt.sec entries but not those emitted by lld. The gdb disassembly outputs exhibit the same difference.
I guess it is probably because lld doesn't emit the bnd prefix (0xf2) while ld.bfd does (BTW, I cannot turn off it: there is no -z nobndplt), and the relevant support isn't currently available in objdump.
If that is the case, it gives me the impression that the ABI can be changed as no tooling has caught up yet.

Since we don't emit the bnd prefix (0xf2) for MPX (MPX support has been dropped by GCC 9), we can merge .plt and .plt.sec entries as follows:

4 endbr64
5 jmpq *xxx(%rip) ; jump to the next endbr64 for lazy binding
4 endbr64
5 pushq           ; relocaton index
5 jmpq *xxx(%rip) ; jump to .plt

This PLT entry takes 4+5+4+5+5=23 bytes. If we aim for an 8-byte alignment, we can get the 24-byte PLT entry size, which improves on the current 32-byte.
(However, if the bnd prefix is prepended to the jmpq instructions, the PLT will take 25 bytes and won't cram in a 24-byte entry. I feel people don't care much about it now, so don't let it pessimize common cases)

I have another question to @xiangzhangllvm: have you considered optimizing the PLT entry size for -z now? We can have such simple PLT entry:

4 endbr64
5 jmpq *xxx(%rip)

Of course, if we use -z now, there may be little reason not to just use -fno-plt when compiling the sources.

Hi MaskRay, I tested your idea by changing the patch:

diff --git a/ELF/Arch/X86_64.cpp b/ELF/Arch/X86_64.cpp
index 7d7cc00..14fb6b2 100644
--- a/ELF/Arch/X86_64.cpp
+++ b/ELF/Arch/X86_64.cpp
@@ -553,11 +553,11 @@ void IntelCET::writePlt(uint8_t *Buf, uint64_t GotPltEntryAddr,
                         unsigned RelOff) const {
   const uint8_t Inst[] = {
       0xf3, 0x0f, 0x1e, 0xfa,       // endbr64
-      0xff, 0x25, 0,    0,    0, 0, // jmpq *got(%rip)
-      0x66, 0x0f, 0x1f, 0x44, 0, 0, // nop
+      0xf2, 0xff, 0x25, 0,    0,    0, 0, // jmpq *got(%rip)
+      0x0f, 0x1f, 0x44, 0, 0, // nop
   };
   memcpy(Buf, Inst, sizeof(Inst));
-  write32le(Buf + 6, GotPltEntryAddr - PltEntryAddr - 10);
+  write32le(Buf + 7, GotPltEntryAddr - PltEntryAddr - 11);
 }

After that, GNU objdump can symbolize .plt.sec, it is really the BND prefix factor.

[xiangzh1@shliclel100 /export/users.shared/xiangzh1/LLVM_ORG/test]$ld.lld -e func1 t.o t1.so -o tout-new
[xiangzh1@shliclel100 /export/users.shared/xiangzh1/LLVM_ORG/test]$objdump -d tout-new

tout-new:     file format elf64-x86-64


Disassembly of section .text:

0000000000201000 <func1>:
  201000:       e8 3b 00 00 00          callq  201040 <func2@plt>
  201005:       e8 46 00 00 00          callq  201050 <func3@plt>
  20100a:       c3                      retq

Disassembly of section .plt:

0000000000201010 <.plt>:
  201010:       ff 35 f2 1f 00 00       pushq  0x1ff2(%rip)        # 203008 <_DYNAMIC+0x1008>
  201016:       ff 25 f4 1f 00 00       jmpq   *0x1ff4(%rip)        # 203010 <_DYNAMIC+0x1010>
  20101c:       0f 1f 40 00             nopl   0x0(%rax)
  201020:       f3 0f 1e fa             endbr64
  201024:       68 00 00 00 00          pushq  $0x0
  201029:       e9 e2 ff ff ff          jmpq   201010 <func1+0x10>
  20102e:       66 90                   xchg   %ax,%ax
  201030:       f3 0f 1e fa             endbr64
  201034:       68 01 00 00 00          pushq  $0x1
  201039:       e9 d2 ff ff ff          jmpq   201010 <func1+0x10>
  20103e:       66 90                   xchg   %ax,%ax

Disassembly of section .plt.sec:

0000000000201040 <func2@plt>:
  201040:       f3 0f 1e fa             endbr64
  201044:       f2 ff 25 cd 1f 00 00    bnd jmpq *0x1fcd(%rip)        # 203018 <func2>
  20104b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000201050 <func3@plt>:
  201050:       f3 0f 1e fa             endbr64
  201054:       f2 ff 25 c5 1f 00 00    bnd jmpq *0x1fc5(%rip)        # 203020 <func3>
  20105b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

But I think this is not the good reason to change the ABI, it is better to change the objdump tool's code to fit the new plt.
GCC have implemented this ABI for one year, It will be very hard to push the changing work.

We never considered optimizing the PLT entry size, because we can not make sure how the developers to "deal with" the old 16-bytes plt in their old programs.
That is also one of the reasons why GCC (H.J.) split the plt to 2 16-bytes plts at first.

ruiu added a comment.Apr 10 2019, 2:43 AM

Hi MaskRay, I tested your idea by changing the patch:

diff --git a/ELF/Arch/X86_64.cpp b/ELF/Arch/X86_64.cpp
index 7d7cc00..14fb6b2 100644
--- a/ELF/Arch/X86_64.cpp
+++ b/ELF/Arch/X86_64.cpp
@@ -553,11 +553,11 @@ void IntelCET::writePlt(uint8_t *Buf, uint64_t GotPltEntryAddr,
                         unsigned RelOff) const {
   const uint8_t Inst[] = {
       0xf3, 0x0f, 0x1e, 0xfa,       // endbr64
-      0xff, 0x25, 0,    0,    0, 0, // jmpq *got(%rip)
-      0x66, 0x0f, 0x1f, 0x44, 0, 0, // nop
+      0xf2, 0xff, 0x25, 0,    0,    0, 0, // jmpq *got(%rip)
+      0x0f, 0x1f, 0x44, 0, 0, // nop
   };
   memcpy(Buf, Inst, sizeof(Inst));
-  write32le(Buf + 6, GotPltEntryAddr - PltEntryAddr - 10);
+  write32le(Buf + 7, GotPltEntryAddr - PltEntryAddr - 11);
 }

After that, GNU objdump can symbolize .plt.sec, it is really the BND prefix factor.

[xiangzh1@shliclel100 /export/users.shared/xiangzh1/LLVM_ORG/test]$ld.lld -e func1 t.o t1.so -o tout-new
[xiangzh1@shliclel100 /export/users.shared/xiangzh1/LLVM_ORG/test]$objdump -d tout-new

tout-new:     file format elf64-x86-64


Disassembly of section .text:

0000000000201000 <func1>:
  201000:       e8 3b 00 00 00          callq  201040 <func2@plt>
  201005:       e8 46 00 00 00          callq  201050 <func3@plt>
  20100a:       c3                      retq

Disassembly of section .plt:

0000000000201010 <.plt>:
  201010:       ff 35 f2 1f 00 00       pushq  0x1ff2(%rip)        # 203008 <_DYNAMIC+0x1008>
  201016:       ff 25 f4 1f 00 00       jmpq   *0x1ff4(%rip)        # 203010 <_DYNAMIC+0x1010>
  20101c:       0f 1f 40 00             nopl   0x0(%rax)
  201020:       f3 0f 1e fa             endbr64
  201024:       68 00 00 00 00          pushq  $0x0
  201029:       e9 e2 ff ff ff          jmpq   201010 <func1+0x10>
  20102e:       66 90                   xchg   %ax,%ax
  201030:       f3 0f 1e fa             endbr64
  201034:       68 01 00 00 00          pushq  $0x1
  201039:       e9 d2 ff ff ff          jmpq   201010 <func1+0x10>
  20103e:       66 90                   xchg   %ax,%ax

Disassembly of section .plt.sec:

0000000000201040 <func2@plt>:
  201040:       f3 0f 1e fa             endbr64
  201044:       f2 ff 25 cd 1f 00 00    bnd jmpq *0x1fcd(%rip)        # 203018 <func2>
  20104b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000201050 <func3@plt>:
  201050:       f3 0f 1e fa             endbr64
  201054:       f2 ff 25 c5 1f 00 00    bnd jmpq *0x1fc5(%rip)        # 203020 <func3>
  20105b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

But I think this is not the good reason to change the ABI, it is better to change the objdump tool's code to fit the new plt.
GCC have implemented this ABI for one year, It will be very hard to push the changing work.

I don't think I'm convinced. MaskRay's point is that the tooling's support of CET is buggy and needs changing anyway. Currently it looks like no tool seems to be working correctly. Then, why don't you need to stick with the ABI that is claimed to provide better performance without any evidence? Keeping ABI simple is perhaps much more important than you may be thinking. I really wish you will take it more seriously.

Besides that, I don't think GCC implemented any of the feature of the PLT -- it is only in the BFD linker but not in gold. Also no one is using the feature in the wild because no processors supporting CET is available in the market. Everything seems to be experimental yet at the moment.

We never considered optimizing the PLT entry size, because we can not make sure how the developers to "deal with" the old 16-bytes plt in their old programs.
That is also one of the reasons why GCC (H.J.) split the plt to 2 16-bytes plts at first.

I don't think I'm convinced. MaskRay's point is that the tooling's support of CET is buggy and needs changing anyway. Currently it looks like no tool seems to be working correctly. Then, why don't you need to stick with the ABI that is claimed to provide better performance without any evidence? Keeping ABI simple is perhaps much more important than you may be thinking. I really wish you will take it more seriously.

Tooling output changes are straight forward to fix, and yes CET is new, and tooling is still being adjusted, but this adjustment is carried out by following the published ABI. We recently fixed ltrace for CET plt usage in Fedora for example.

To change the ABI you must go upstream to the ABI discussion list and have that discussion with the hardware vendor. The hardware vendor supports many different operating systems and implementations and the changes to the standard need to be discussed there. Alternatively you can make and adhere to your own standards and work with downstream tooling to support those standards as alternative implementations (hopefully with interoperability). I would not like to see two alternative ABIs, that leads to serious problems in downstream tooling and duplicated effort.

My opinion is that lld is too late to make the requested change to the PLT format at this point. You either get to adhere to the standard and participate in distribution builds, or not and you don't.

Besides that, I don't think GCC implemented any of the feature of the PLT -- it is only in the BFD linker but not in gold. Also no one is using the feature in the wild because no processors supporting CET is available in the market. Everything seems to be experimental yet at the moment.

There is nothing wrong with a feature being only in the BFD linker, it is the canonical linker for many distributions and build systems. Gold is a distinct linker from BFD, just like lld is a distinct linker from BFD. Gold has limited support for working with CET-enabled libraries, and the point of this support is just to disable CET (that's what is minimally required realy). Lack of feature parity with BFD may mean downstream distributions cannot use the linker for building distribution binaries that meet security policy for the target machine e.g. must have CET enabled.

Red Hat Enterprise Linux 8 Beta (publicly available for x86_64) is using the current CET PLT layout generated by BFD.

Hi Ruiu & Fangrui
Your idea is also make sense, but It is really too too late to change ABI now.
I wish we can reach a consensus as soon as possible.
Thank you all, very much!

ruiu added a comment.Apr 16 2019, 3:13 AM

The root cause of the problem here is that the API doesn't have enough rationale to choose a complex format than a simpler one, even though it was set just a few years ago and there's no historical burden to do so. I was repeatedly told that it is too late to change the ABI but that wasn't convincing (that's why the same topic came up many times in the thread) because no one said how much it is too late -- what thing is already done and if we redo it how long does it likely to take, etc. So, Carlos, thank you for giving some examples of the stuff that's already done to some programs.

I hope that future changes to the ABI have more rationale if some obvious simpler file format doesn't work for some reason.

So it seems I have two choices: just land this patch or start a discussion on the ABI mailing list to change the ABI. The latter would take much time.

Honestly, although I'm generally interested in the binary-level file format of executable files, I'm less interested in CET itself, especially compared to those who are working for Intel. So I'm naturally less interested in making efforts to propose a new ABI for CET. I guess I need a cooperation from Intel people to make a desired change. But looks like no one from Intel is interested in it. If that's the case, I probably should just land this patch and accept the consequence. But before that, let me ask again -- are you guys OK with the new 2-PLT scheme? Do you just accept that? Sorry for sticking to this point.

MaskRay added a comment.EditedApr 16 2019, 7:20 AM

Red Hat Enterprise Linux 8 Beta (publicly available for x86_64) is using the current CET PLT layout generated by BFD.

If it uses MPX and the bnd prefix, it is probably not too late to change.

To change the ABI you must go upstream to the ABI discussion list and have that discussion with the hardware vendor. The hardware vendor supports many different operating systems and implementations and the changes to the standard need to be discussed there. Alternatively you can make and adhere to your own standards and work with downstream tooling to support those standards as alternative implementations (hopefully with interoperability). I would not like to see two alternative ABIs, that leads to serious problems in downstream tooling and duplicated effort.

My opinion is that lld is too late to make the requested change to the PLT format at this point. You either get to adhere to the standard and participate in distribution builds, or not and you don't.

@codonell @xiangzhangllvm I did create a thread in the x86-64-abi mailing list https://groups.google.com/forum/#!topic/x86-64-abi/Z6GqQDy_NaI The only reply was from the maintainer. I think I'd like more opinions from others.

I am a happy Linux glibc user but I want to learn things from multiple perspectives - insights from other parties are good. So I asked some FreeBSD folks to chime in and I asked how musl people felt about the second PLT scheme. Since this second PLT is really Linux+x86+glibc lazybind specific - it implies more overhead to other libcs. musl people can probably live with -fno-plt (musl doesn't support glibc style lazybind - they don't need PLT at all)

To be honest I feel the ABI is proposed in such a way that enforces other OSes, linkers, binary manipulation tools, emulators, diassemblers, binary analyzers, instrumentation tools and profilers to follow suit: there are announcements but no design discussions. However, I know at this point arguing is probably unavailing. Therefore, I believe I can be convinced if someone (from Intel, Red Hat, etc) gives performance numbers demonstrating this second PLT scheme has indeed the cache-locality merit and is better than the alternative 24-byte PLT entry scheme.

We are Ok with with the 2-PLT scheme.

I think I'd like **more opinions from others**.  **there are announcements but no design discussions**

Hi, Fangrui, the 2-PLT scheme has pass the ABI discussions, not H.J just made it by only himself.

**gives performance numbers demonstrating this second PLT scheme has indeed the cache-locality merit and is better than the alternative 24-byte PLT entry scheme**.

Yes, We really have not performance evidence, but this design is really more performance in theory, and this scheme has done years , and nothing went wrong.
The 24-bit plt is beautiful, but it is not the overwhelming reason to overthrow the big previous works.

ruiu added a comment.Apr 18 2019, 1:28 AM

We are Ok with with the 2-PLT scheme.

I think I'd like **more opinions from others**.  **there are announcements but no design discussions**

Hi, Fangrui, the 2-PLT scheme has pass the ABI discussions, not H.J just made it by only himself.

**gives performance numbers demonstrating this second PLT scheme has indeed the cache-locality merit and is better than the alternative 24-byte PLT entry scheme**.

Yes, We really have not performance evidence, but this design is really more performance in theory, and this scheme has done years , and nothing went wrong.
The 24-bit plt is beautiful, but it is not the overwhelming reason to overthrow the big previous works.

Let's discuss that on the x86-64 System V ABI mailing list. I signed up and once approved, I'll send a mail there.

xiangzhangllvm added a comment.EditedApr 18 2019, 8:20 PM

Hi friends, I write a simple test to test the performance:
To simply test the performance affect I changed the PLT size from 16 Bytes to 24Bytes in LLVM 8.0.0.

[xiangzh1@scels74 /export/iusers/xiangzh1/LLVM/LLVMORG]$diff llvm800/tools/lld/ELF/Arch/X86_64.cpp llvm800-PLT24/tools/lld/ELF/Arch/X86_64.cpp
67c67
<   PltEntrySize = 16;
---
>   PltEntrySize = 24; //16-->24
157a158,159
>       0x66, 0x90,                   //nop
>       0x66, 0x0f, 0x1f, 0x44, 0, 0, // nop
[xiangzh1@scels74 /export/iusers/xiangzh1/LLVM/LLVMORG]$

Then I write a simple case like following:

Main.c
#include<stdio.h>
#include<time.h>
#define FUNCTION_PLT( ID ) fplt##ID();
int main(){
int i;
clock_t t1, t2;
t1 = clock();
for(i = 0; i < 1000000; i++){
FUNCTION_PLT( 0 )
FUNCTION_PLT( 1 )
FUNCTION_PLT( 2 )
...
FUNCTION_PLT( n )
}
t2 = clock();
printf("Time cost %ld\n", t2-t1 );
  return 0;
}

Fplt.c
#define FUNCTION_PLT( ID ) int fplt##ID(){return ID;}
FUNCTION_PLT( 0 )
FUNCTION_PLT( 1 )
FUNCTION_PLT( 2 )
.....
FUNCTION_PLT( n )

After that, I build them with 16Byte PLT-LLD and 24Byte PLT-LLD:

/export/iusers/xiangzh1/LLVM/LLVMORG/build800/bin/clang -c Main.c Fplt.c -w;
/export/iusers/xiangzh1/LLVM/LLVMORG/build800/bin/ld.lld -shared Fplt.o -o Fplt.so;
/export/iusers/xiangzh1/LLVM/LLVMORG/build800-plt24/bin/clang -fuse-ld=lld Main.o Fplt.so -o a24.out;
/export/iusers/xiangzh1/LLVM/LLVMORG/build800/bin/clang -fuse-ld=lld Main.o Fplt.so -o a16.out;

Then I copy them to my personal PC which just only I using it.
The follow text is copied from my personal PC (originally, I not change any data):

xiangzh1@xiangzh1:~/test/PLT/test$ export LD_LIBRARY_PATH=./
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2924013
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2721932
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2855578
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2330382
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2958377
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2361670
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2509809
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2451191
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2779484
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2551673
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2752218
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2752107
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2584999
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2630829
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 3148183
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2342366
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2678235
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2518696
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2767221
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2712448
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2667558
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2818396
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2740905
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2420335
xiangzh1@xiangzh1:~/test/PLT/test$ ./a24.out
Time cost 2586176
xiangzh1@xiangzh1:~/test/PLT/test$ ./a16.out
Time cost 2556211
xiangzh1@xiangzh1:~/test/PLT/test$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
Stepping:            9
CPU MHz:             845.906
CPU max MHz:         4200.0000
CPU min MHz:         800.0000
BogoMIPS:            7200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
xiangzh1@xiangzh1:~/test/PLT/test$ ls
a16.out  a24.d  a24.dump  a24.out  cmd.txt  Fplt-bk.c  Fplt-bk.o  Fplt.c  Fplt.o  Fplt.so  Main-bk.c  Main-bk.o  Main.c  Main.o  perf.data  perf.data.a24  perf.data.old  pp  tt
xiangzh1@xiangzh1:~/test/PLT/test$ head -n 15 Main.c
#include<stdio.h>
#include<time.h>

#define FUNCTION_PLT( ID ) fplt##ID();

int main(){
int i;
clock_t t1, t2;
t1 = clock();
for(i = 0; i < 1000000; i++){

FUNCTION_PLT( 0 )
FUNCTION_PLT( 1 )
FUNCTION_PLT( 2 )
FUNCTION_PLT( 3 )
xiangzh1@xiangzh1:~/test/PLT/test$ tail -n 10 Main.c
FUNCTION_PLT( 724 )
FUNCTION_PLT( 725 )
FUNCTION_PLT( 726 )
FUNCTION_PLT( 727 )
FUNCTION_PLT( 728 )
}
t2 = clock();
printf("Time cost %ld\n", t2-t1 );
  return 0;
}
xiangzh1@xiangzh1:~/test/PLT/test$ head -n 10 Fplt.c
#define FUNCTION_PLT( ID ) int fplt##ID(){return ID;}
FUNCTION_PLT( 0 )
FUNCTION_PLT( 1 )
FUNCTION_PLT( 2 )
FUNCTION_PLT( 3 )
FUNCTION_PLT( 4 )
FUNCTION_PLT( 5 )
FUNCTION_PLT( 6 )
FUNCTION_PLT( 7 )
FUNCTION_PLT( 8 )
xiangzh1@xiangzh1:~/test/PLT/test$ tail -n 10 Fplt.c
FUNCTION_PLT( 719 )
FUNCTION_PLT( 720 )
FUNCTION_PLT( 721 )
FUNCTION_PLT( 722 )
FUNCTION_PLT( 723 )
FUNCTION_PLT( 724 )
FUNCTION_PLT( 725 )
FUNCTION_PLT( 726 )
FUNCTION_PLT( 727 )
FUNCTION_PLT( 728 )
xiangzh1@xiangzh1:~/test/PLT/test$

I found most time the big plt cost more time than small plt file.
This change just base on LLVM8.0.0-LLD, but the the principle is the same.
I also catch the cache miss in main function (I add the loop times to 10000000):

ruiu added a comment.Apr 18 2019, 9:13 PM

Xiang,

I believe that your benchmark is inappropriate and does not correctly capture correct performance characteristics of PLT entries of different sizes.

First of all, your microbenchmark is too artificial. You created lots of PLT entries for functions that don't do anything but immediately return, and your program doesn't do anything but hammer PLT entries all the time. That usage pattern is very different from real programs.

Second of all, and more importantly, I couldn't reproduce your result with the PLT of size 32 bytes. One possibility is that the performance you noticed was due to the fact that half of the 24-byte PLT entries are not aligned to 16 byte boundaries. Intel x86-64 Optimization Manual recommends that you align jump target to 16 bytes, so it might not be a surprise, though.

My machine is AMD Ryzen Threadripper 2990WX, so your benchmark's behavior on my machine might be different from yours, but still I could observe that your benchmark slows down only when I choose to use 24-byte PLT entry, as you can see below.

$ for i in `seq 1 10`; do echo -n "PLT16 "; ./main.16; echo -n "PLT24 "; ./main.24; echo -n "PLT32 "; ./main.32; echo; done
PLT16 Time cost 2667549
PLT24 Time cost 2891321
PLT32 Time cost 2656223

PLT16 Time cost 2682222
PLT24 Time cost 3245131
PLT32 Time cost 2682468

PLT16 Time cost 2659353
PLT24 Time cost 2895083
PLT32 Time cost 2657525

PLT16 Time cost 2680843
PLT24 Time cost 2867365
PLT32 Time cost 2676404

PLT16 Time cost 2661513
PLT24 Time cost 2931767
PLT32 Time cost 2653236

PLT16 Time cost 2680713
PLT24 Time cost 2872741
PLT32 Time cost 2678326

PLT16 Time cost 2661513
PLT24 Time cost 2892384
PLT32 Time cost 2656584

PLT16 Time cost 2683864
PLT24 Time cost 2887594
PLT32 Time cost 2658880

PLT16 Time cost 2682057
PLT24 Time cost 2911524
PLT32 Time cost 2679250

PLT16 Time cost 2654767
PLT24 Time cost 2915021
PLT32 Time cost 2659874

main.16 is linked with the regular lld. main.24 is linked with a modified lld whose output PLT size is 24 bytes. main.32 is linked with a modified lld whose output PLT size is 32 bytes.

As you can see, even with your synthesized microbenchmark, I couldn't see the difference of 16-byte PLT and 32-byte PLT. With that result, I have to say that I'm even more skeptical of the 2-PLT scheme than before.

Ping. I'm planning to upstream support for the AArch64 BTI and PAC features in LLD, llvm-objdump and llvm-readelf. I've written the LLD patches on top of this patch as the .note.gnu.property section is used in a similar way (GNU_PROPERTY_AARCH64_FEATURE_1_AND) so only a few small alterations are needed to that part of the code. I'm wondering if it is best to:

  • Submit the patches rebased on top of this patch.
    • The advantage here is that you could see the differences in supporting X86 and AArch64 more clearly.
  • Strip out the .note.gnu.property code from this patch and resubmit it as part of the AArch64 BTI and PAC.
    • The advantage here is that the AArch64 feature doesn't have to wait for CET to land.

Any suggestions or preferences? I have the code written and will start upstreaming tomorrow. There are some LLVM for llvm-readobj and llvm-objdump that the LLD patches will depend on so I'll start with those.

The ELF ABI has details on the .note.gnu.property support. https://developer.arm.com/docs/ihi0056/latest/elf-for-the-arm-64-bit-architecture-aarch64-abi-2019q1-documentation.

I was also wondering if there has been any progress with this. Was there any more discussion on the ABI mailing list?

I'm sorry, I reflect the problem to some leaders and experts, but the PLT ABI is still undecided. The most concern come from the unknow risk of difference with GNU.

I'm sorry, I reflect the problem to some leaders and experts, but the PLT ABI is still undecided. The most concern come from the unknow risk of difference with GNU.

@xiangzhangllvm If it is still undecided, then there is still opportunity to fix :) IMHO tooling is still immature and currently very few tools understand .plt.sec.

I'd really like to see more discussions about this, e.g. https://groups.google.com/forum/#!topic/x86-64-abi/Z6GqQDy_NaI
If PLT entries should be aligned by 16 to maximize performance, shall we switch to 32-byte PLT entry? Xiang, can you raise a topic on https://sourceware.org/ml/gnu-gabi/2019-q2 or https://sourceware.org/ml/binutils/2019-05/

I'm sorry, I reflect the problem to some leaders and experts, but the PLT ABI is still undecided. The most concern come from the unknow risk of difference with GNU.

@xiangzhangllvm If it is still undecided, then there is still opportunity to fix :) IMHO tooling is still immature and currently very few tools understand .plt.sec.

I'd really like to see more discussions about this, e.g. https://groups.google.com/forum/#!topic/x86-64-abi/Z6GqQDy_NaI
If PLT entries should be aligned by 16 to maximize performance, shall we switch to 32-byte PLT entry? Xiang, can you raise a topic on https://sourceware.org/ml/gnu-gabi/2019-q2 or https://sourceware.org/ml/binutils/2019-05/

hi, fangrui, 32-byte PLT entry is also aligned by 16, I have test the performance before, they are same.

MaskRay added a comment.EditedMay 28 2019, 8:33 PM

I'm sorry, I reflect the problem to some leaders and experts, but the PLT ABI is still undecided. The most concern come from the unknow risk of difference with GNU.

@xiangzhangllvm If it is still undecided, then there is still opportunity to fix :) IMHO tooling is still immature and currently very few tools understand .plt.sec.

I'd really like to see more discussions about this, e.g. https://groups.google.com/forum/#!topic/x86-64-abi/Z6GqQDy_NaI
If PLT entries should be aligned by 16 to maximize performance, shall we switch to 32-byte PLT entry? Xiang, can you raise a topic on https://sourceware.org/ml/gnu-gabi/2019-q2 or https://sourceware.org/ml/binutils/2019-05/

hi, fangrui, 32-byte PLT entry is also aligned by 16, I have test the performance before, they are same.

Great. If the unified 32-byte PLT entry scheme doesn't regress, let's delete .plt.sec and fix the ABI. A simplified scheme will ease the adoption of CET. I guess you are in a good position to do this. You may send an email to the gnu-gabi and binutils mailing lists.

The current problem is the 2nd PLT ABI was in GCC side for more than 2 years. I think we need to work with GCC community to have a consistent ABI so we won't break each other. Let's have more discussion in x86_64 ABI site.

The current problem is the 2nd PLT ABI was in GCC side for more than 2 years. I think we need to work with GCC community to have a consistent ABI so we won't break each other. Let's have more discussion in x86_64 ABI site.

@annita.zhang Thanks for weighing in! AFAIK, .plt.sec is a section synthesized by the linker. GCC doesn't know or have to know how PLT entries are represented.

for more than 2 years

This is less of a concern to me. Even GNU objdump compiled from binutils-gdb trunk (the same repository as ld.bfd) doesn't know how to symbolize such .plt.sec or .plt entries, so I believe the tooling is still immature and the adoption is low.

Annita, you and Xiang may have better ideas which parties should be involved in the discussion. Could one of you start a thread on x86-64-abi, binutils, and gnu-gabi? And/or gcc (if you think GCC folks should also be involved). You may CC Rui, Peter, and me.

ruiu added a comment.May 29 2019, 5:45 AM

I was planning to start a discussion at the gabi mailing list. I noticed that I didn't join the mailing list, and it's moderated, so I waited my join request would be approved, and then when it was approved, I didn't start a discussion immediately. Sorry for my negligence. Let me start a discussion.

In lld we already support two different PLT formats for mitigating Spectre. It might make sense to add them to the standard if enough number of people are already using them. Last I tried it with gdb, gdb crashes because of the nonstandard PLT format.

MaskRay added inline comments.May 30 2019, 3:24 AM
lld/ELF/Config.h
85 ↗(On Diff #193233)

Split off the X86Features (or ANDFeatures) change and commit, to make D62609 easier to review?

ruiu added a comment.Jun 3 2019, 6:37 AM

I sent a proposal to the x86-64-abi mailing list. You can read the thread here: https://groups.google.com/forum/?hl=en#!topic/x86-64-abi/sQcX3__r4c0

ruiu updated this revision to Diff 203536.Jun 7 2019, 4:58 AM
  • rebase so that the patch does not diverge too much from the trunk
ruiu updated this revision to Diff 203771.Jun 9 2019, 10:20 PM
  • fix tests