Page MenuHomePhabricator

[ELF] Add -z separate-code and pad the last page of last PF_X PT_LOAD with traps only if -z separate-code is specified
ClosedPublic

Authored by MaskRay on Jul 17 2019, 11:12 PM.

Details

Summary

This patch

  1. adds -z separate-code and -z noseparate-code (default).
  2. changes the condition that the last page of last PF_X PT_LOAD is padded with trap instructions. Current condition (after D33630): if there is no SECTIONS commands. After this change: if -z separate-code is specified.

-z separate-code was introduced to ld.bfd in 2018, to place the text
segment in its own pages. There is no overlap in pages between an
executable segment and a non-executable segment:

  1. RX cannot load initial contents from R or RW(or non-SHF_ALLOC).
  2. R and RW(or non-SHF_ALLOC) cannot load initial contents from RX.

lld's current status:

  • Between R and RX: in Writer<ELFT>::fixSectionAlignments(), the start of a segment is always aligned to maxPageSize, so the initial contents loaded by R and RX do not overlap. I plan to allow overlaps in D64906 if -z noseparate-code is in effect.
  • Between RX and RW(or non-SHF_ALLOC if RW doesn't exist): we currently unconditionally pad the last page to commonPageSize (defaults to 4096 on all targets we support). This patch will make it effective only if -z separate-code is specified.

-z separate-code is a dubious feature that intends to reduce the number
of ROP gadgets (which is actually ineffective because attackers can find
plenty of gadgets in the text segment, no need to find gadgets in
non-code regions).

With the overlapping PT_LOAD technique D64906, -z noseparate-code
removes two more alignments at segment boundaries than -z separate-code.
This saves at most defaultCommonPageSize*2 bytes, which are significant
on targets with large defaultCommonPageSize (AArch64/MIPS/PPC: 65536).

Issues/feedback on alignment at segment boundaries to help understand
the implication:

  • binutils PR24490 (the situation on ld.bfd is worse because they have two R-- on both sides of R-E so more alignments.)
  • In binutils, the 2018-02-27 commit "ld: Add --enable-separate-code" made -z separate-code the default on Linux. https://github.com/richfelker/musl-cross-make/commit/d969dea983a2cc54a1e0308a0cdeb6c3307e4bfa In musl-cross-make, binutils is configured with --disable-separate-code to address size regressions caused by -z separate-code. (lld actually has the same issue, which I plan to fix in a future patch. The ld.bfd x86 status is worse because they default to max-page-size=0x200000).
  • Stef O'Rear: I'm opposed to any kind of page alignment at the text/rodata line (having a partial page of text aliased as rodata and vice versa has no demonstrable harm, and I actually care about small systems).

So, make -z noseparate-code the default.

Diff Detail

Repository
rL LLVM

Event Timeline

MaskRay created this revision.Jul 17 2019, 11:12 PM
MaskRay edited the summary of this revision. (Show Details)Jul 17 2019, 11:17 PM
grimar added a comment.EditedJul 18 2019, 12:30 AM

My first impression: seems adding the new -z separate-code/-z noseparate-code option is a good thing to do. But
it is probably arguable what should be the default. Can we just keep the existent behavior as a default?

MaskRay added a comment.EditedJul 18 2019, 12:39 AM

My first impression: seems adding the new -z separate-code/-z noseparate-code option is a good thing to do.

+1

But it is probably arguable what should be the default. Can we just keep the existent behavior as a default?

The description summaries my reasoning why -z noseparate-code should be the default. -z separate-code is a dubious feature that intends (not proven) to reduce the number of ROP gadgets (which is actually very ineffective because attackers can find plenty of gadgets in the text segment, no need to find gadgets in non-code regions)

I have another patch D64906 to demonstrate how much it can save if we use -z noseparate-code by default:

# a is a trivial executable: `int main() {}`
% stat -c %s a   # before this patch
200712
% stat -c %s a   # D64906, -z noseparate-code
7256
% stat -c %s a   # D64906, -z separate-code. This still pays the cost of two alignments
135656
// before this patch
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000564 0x000564 R   0x10000
  LOAD           0x010000 0x0000000000010000 0x0000000000010000 0x0004ec 0x0004ec R E 0x10000
  LOAD           0x020000 0x0000000000020000 0x0000000000020000 0x0001e0 0x0001e0 RW  0x10000
  LOAD           0x030000 0x0000000000030000 0x0000000000030000 0x000010 0x000050 RW  0x10000

// D64906, -z noseparate-code
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000564 0x000564 R   0x10000                   
  LOAD           0x000580 0x0000000000010580 0x0000000000010580 0x0004ec 0x0004ec R E 0x10000
  LOAD           0x000a70 0x0000000000020a70 0x0000000000020a70 0x0001e0 0x0001e0 RW  0x10000                  
  LOAD           0x000c50 0x0000000000030c50 0x0000000000030c50 0x000010 0x000050 RW  0x10000

// D64906, -z separate-code
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000564 0x000564 R   0x10000
  LOAD           0x010000 0x0000000000010000 0x0000000000010000 0x0004ec 0x0004ec R E 0x10000
  LOAD           0x020000 0x0000000000020000 0x0000000000020000 0x0001e0 0x0001e0 RW  0x10000
  LOAD           0x0201e0 0x00000000000301e0 0x00000000000301e0 0x000010 0x000050 RW  0x10000
ruiu added a comment.Jul 18 2019, 2:50 AM

My first impression: seems adding the new -z separate-code/-z noseparate-code option is a good thing to do.

+1

But it is probably arguable what should be the default. Can we just keep the existent behavior as a default?

The description summaries my reasoning why -z noseparate-code should be the default. -z separate-code is a dubious feature that intends (not proven) to reduce the number of ROP gadgets (which is actually very ineffective because attackers can find plenty of gadgets in the text segment, no need to find gadgets in non-code regions)

It's a little off-topic, but I agree with Fangrui with this point. I don't think -z separate-code is a very effective defense against a ROP/JOP. If a program is not very small, it shouldn't be too hard to find a gadget from a text segment, and I don't think an additional piece of data that is at the remaining part of the last page of a text segment can significantly increase a risk. If your program is vulnerable to ROP/JOP, and gadgets are at predictable places, it's a game end regardless of an existence of some extra data at the end of a text segment.

I have another patch D64906 to demonstrate how much it can save if we use -z noseparate-code by default:

# a is a trivial executable: `int main() {}`
% stat -c %s a   # before this patch
200712
% stat -c %s a   # D64906, -z noseparate-code
7256
% stat -c %s a   # D64906, -z separate-code. This still pays the cost of two alignments
135656
// before this patch
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000564 0x000564 R   0x10000
  LOAD           0x010000 0x0000000000010000 0x0000000000010000 0x0004ec 0x0004ec R E 0x10000
  LOAD           0x020000 0x0000000000020000 0x0000000000020000 0x0001e0 0x0001e0 RW  0x10000
  LOAD           0x030000 0x0000000000030000 0x0000000000030000 0x000010 0x000050 RW  0x10000

// D64906, -z noseparate-code
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000564 0x000564 R   0x10000                   
  LOAD           0x000580 0x0000000000010580 0x0000000000010580 0x0004ec 0x0004ec R E 0x10000
  LOAD           0x000a70 0x0000000000020a70 0x0000000000020a70 0x0001e0 0x0001e0 RW  0x10000                  
  LOAD           0x000c50 0x0000000000030c50 0x0000000000030c50 0x000010 0x000050 RW  0x10000

// D64906, -z separate-code
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000564 0x000564 R   0x10000
  LOAD           0x010000 0x0000000000010000 0x0000000000010000 0x0004ec 0x0004ec R E 0x10000
  LOAD           0x020000 0x0000000000020000 0x0000000000020000 0x0001e0 0x0001e0 RW  0x10000
  LOAD           0x0201e0 0x00000000000301e0 0x00000000000301e0 0x000010 0x000050 RW  0x10000

OK. I have no more comments.

Would it be possible to update the description. It started as why we should not pad the end of the last PF_X program segment, and looks like we have added -z[no-]separate code. Would it also be possible to define what -zseparate-code is supposed to do in LLD? A mini spec I guess. For example should -zno-separate-code affect our default image layout? I'm finding it difficult to check the code without knowing what it is supposed to do.

MaskRay updated this revision to Diff 210528.Jul 18 2019, 4:37 AM
MaskRay edited the summary of this revision. (Show Details)

Would it also be possible to define what -zseparate-code is supposed to do in LLD?

Update description to explain -z separate-code

MaskRay edited the summary of this revision. (Show Details)Jul 18 2019, 4:38 AM

Would it also be possible to define what -zseparate-code is supposed to do in LLD?

Update description to explain -z separate-code

Thanks for the update, I understand now that there could be some more work to be done on -zseparate-code.

I think that -zseparate-code is only default for x86-64 and there was a change to the default maxpagesize to compensate for the size. I'm also in favour of small image sizes particularly for the embedded use case and in certain resource constrained phones/laptops the size aggregated across a distro can be important. Would be good to hear from the implementers of D33630 to check that they are happy to had -zseparate-code.

MaskRay edited the summary of this revision. (Show Details)Jul 18 2019, 7:32 PM

I have several points on this.

First, all the differences are demonstrated on uncompressed binaries, but if the binary size was really important for you, you'd always use compression in practice. For example, on Fuchsia where binary size is critical, we never leave binaries uncompressed on disk. To demonstrate this, I've used the same example that @MaskRay used:

$ echo 'int main() {}' > t.c
$ clang --target=aarch64-linux-gnu --sysroot=path/to/sysroot t.c -o t
$ stat -c %s t
201136
$ lz4 t t.lz4
$ stat -c %s t.lz4
4265
$ zstd t -o t.zstd
$ stat -c %s t.zstd
2435

With D64906 and D64930 applied:

$ stat -c %s t
70064
$ stat -c %s t.lz4
3730
$ stat -c %s t.zstd
2401

With zstd (which is default compressor used by Fuchsia), the difference between the two schemes is 34 bytes.

Second, with -z noseparate-code, you'll end up with a partial page wasting additional memory at runtime So the difference between -z separate-code and -z noseparate-code isn't really about security vs space as is presented in the description, it's primarily a trade-off between memory and disk space. We can reclaim most disk space loss using compression as shown above, but we cannot reclaim the memory loss.

Third, there's the security aspect. I agree that the reduction of the number of ROP gadgets is limited, but that doesn't mean we shouldn't do it. I've consulted this change with our security team today and their response is that we always want -z separate-code! Why? It's a basic security hygiene, best practice, defense in depth kind of measure. It's the same reason why we fill space between functions with traps, why we use stack protectors, why we use ASLR, etc. We have known attacks against each of those mechanisms, and not all of them have proven as effective as was believed in the past, but that doesn't mean we're going to disable them. The same applies here.

With all that being said, if there's a combination of flag that would allow preserving the current behavior, we'd set those in our driver like we already do for -z rodynamic, so I'm not going to block this change. However, I'd still much rather see the -z separate-code being the default because otherwise we're going to loose the advantages I described above (memory savings, security) on other platforms like Linux.

One potential solution might be to control the default behavior with a CMake option which is similar to the solution also used in binutils by ld (--enable-separate-code/--disable-separate-code). While this isn't being used to control the defaults in lld at the moment, it's used elsewhere in LLVM for that purpose (e.g. ENABLE_LINKER_BUILD_ID or ENABLE_X86_RELAX_RELOCATIONS in Clang) and enables each toolchain vendor to choose their defaults.

Although I don't think that it applies to this current patch as I believe that only part of -z separate-code is implemented here. A colleague of mine mentioned that if we have a linker script similar to GNU ld's default, where the .text section is before .rodata then we need to be careful not to include the ELF headers in the executable segment. This won't be an issue for LLD's default image layout as the ro-data comes before the .text.

MaskRay added a comment.EditedJul 21 2019, 8:06 PM

First, all the differences are demonstrated on uncompressed binaries, but if the binary size was really important for you, you'd always use compression in practice.

This works on platforms where you control the whole stack, but the practice may not apply elsewhere, e.g. if you provide components to another Android package when they require the pre-compression component to be smaller than 1M. This is a fairly reasonable request because the package itself will be zipped, so they ask you to control the pre-compression size. You also pay the cost to uncompress the binaries before executing them. It the binary is not uncompressed on the disk, if you run multiple instances of the same executable, the image cannot be shared.

With D64906 and D64930 applied:

$ stat -c %s t

70064

Thanks for your testing! I think the result was measured before I updated a newer revision where I removed one more alignment boundary (after I realized the two RW (relro and non-relro) can overlap as well).

In my experiments, the size difference is now 200760-6672 bytes, which is close to 3*65536, the theoretical maximum amount of saving due the removal of 3 alignment boundaries.

If we can only utilize half of the maxmimum saving, 3*65536/2 = 96k for each binary will still be significant for a distribution (think: every /usr/bin/ and /usr/lib/ binary can benefit from it).

Second, with -z noseparate-code, you'll end up with a partial page wasting additional memory at runtime So the difference between -z separate-code and -z noseparate-code isn't really about security vs space as is presented in the description, it's primarily a trade-off between memory and disk space.

The -z noseparate-code case may or may not load one more page. If the map aligned to a page boundary occupies [0x10000, 0x10200), with an offset, it occupies [0x11000, 0x11200), both are mapped as [0x10000, 0x20000) at runtime. If you uncompress your binary before loading it, I bet there is no timing difference at all because the uncompressed image is already in the buffer cache. The PT_LOAD segments overlap. Moreover, the binary size with the -z noseparate-code case is smaller, therefore the amount of disk IO is less.

Third, there's the security aspect. I agree that the reduction of the number of ROP gadgets is limited, but that doesn't mean we shouldn't do it. I've consulted this change with our security team today and their response is that we always want -z separate-code! Why? It's a basic security hygiene, best practice, defense in depth kind of measure. It's the same reason why we fill space between functions with traps, why we use stack protectors, why we use ASLR, etc. We have known attacks against each of those mechanisms, and not all of them have proven as effective as was believed in the past, but that doesn't mean we're going to disable them. The same applies here.

I am concerned with security and I know a little writing ROP gadgets, but I am not sure the the protection provided by -z separate-code can be compared with other time-tested mitigations like ASLR/stack protector/read_implies_exec (PT_GNU_STACK). My CTF team (venerable on ctftime.org) has lots of experience exploiting binaries. I have consulted them and the feedback is that one does not to exploit the traditional RO part (mapped as RX) to obtain various types of gadgets. Only in one case they had to use a svc instruction in the RO part in thumb mode. However, I am still in favor of lld's current R & RX layout: data traps immediately if you try to call it as functions.

With all that being said, if there's a combination of flag that would allow preserving the current behavior, we'd set those in our driver like we already do for -z rodynamic, so I'm not going to block this change.

This is one of my favorite lld specific options :) If gdb doesn't use DT_DEBUG, I really hope we can switch to -z rodynamic for all architectures. The writable .dynamic was pretty much a mistake.

One potential solution might be to control the default behavior with a CMake option which is similar to the solution also used in binutils by ld (--enable-separate-code/--disable-separate-code). While this isn't being used to control the defaults in lld at the moment, it's used elsewhere in LLVM for that purpose (e.g. ENABLE_LINKER_BUILD_ID or ENABLE_X86_RELAX_RELOCATIONS in Clang) and enables each toolchain vendor to choose their defaults.

Making clangDriver handle the option sounds a good idea. Doing it in lld may not be a very good solution (see more discussions on D56650).

You have a great understanding on the tooling and what defaults are the best on your platform. As a linker default, I think we should probably care more about what regular users want (you have a platform where you control the whole stack and easily tune llvm/clang/libcxx/compiler-rt default whenever applicable but a regular user/distribution doesn't have to background/motivation/interests to tune the defaults.). Increasing every /usr/bin /usr/lib binary by 192kb (or 96kb) for the reduction in the potential (hypothetical) gadgets in RO regions, IMHO is not a worthwhile trade-off for many users. (That said, I do appreciate your efforts to make your platform as secure as possible.) This -z noseparate choice, as the most practical default today, can of course by flipped if it is proven to be a bad choice in the future.

Ping :)

I'm happy with the change to introduce the option and its effects. I'm not sure we have consensus on what the default should be for the option though. Today I have a weak preference for no separate code for Arm and AArch64, but I'd be happy with a conservative approach of keeping the existing --separate-code until there becomes a relatively convenient to make platform/distro specific decisions as ultimately that is where the decision should be made. The separate code default would also lower the amount of tests that would need to be updated for D64906 and D64930 making them easier to review and less of a risky change.

I think that there is a case for build-time defaults in LLD as there is not always a clang driver for the specific platform or LLD might be invoked by GCC. I think D56650 is substantially more of a change from changing defaults of existing command line options as no decision making logic has to change.

MaskRay added a comment.EditedWed, Jul 24, 4:30 AM

I'm happy with the change to introduce the option and its effects. I'm not sure we have consensus on what the default should be for the option though. Today I have a weak preference for no separate code for Arm and AArch64, but I'd be happy with a conservative approach of keeping the existing --separate-code until there becomes a relatively convenient to make platform/distro specific decisions as ultimately that is where the decision should be made. The separate code default would also lower the amount of tests that would need to be updated for D64906 and D64930 making them easier to review and less of a risky change.

As you see, this patch doesn't touch too many tests. The test fixing problem occurs when we do more layout changes (D64906) and enable the technique on more targets:

bool enable = config->emachine == EM_AARCH64 || config->emachine == EM_PPC64; // all new targets here

I think we still have to fix 40%~60% tests (depending on the target) if we use -z separate-code by default:

config->zSeparateCode = getZFlag(args, "separate-code", "noseparate-code", true); // changed to false

With the default relro layout, the commonly used .dynamic and .got are in the first RW (RELRO RW), while the commonly used .got.plt, .data and .bss are in the second RW (non-RELRO RW). The alignment between the two RWs will also affect the addresses. -z separate-code doesn't affect the transition from one RW to the other RW.

ARM/AArch64/PPC/MIPS suffer from the defaultMaxPageSize=64k issue the most. AArch64 and PPC will be fixed by D64906 and D64930. (It would not make it much easier if -z separate-code were used by default.) MIPS has 50+ tests (I may work on it as I want to learn a bit more about MIPS). The 300+ x86 tests that will be affected are a pain. Thankfully it is probably not worth fixing x86 because defaultMaxPageSize=4096 on x86.

🤡

This will make future size optimization possible...

ruiu accepted this revision.Thu, Aug 1, 2:05 AM

LGTM

This revision is now accepted and ready to land.Thu, Aug 1, 2:05 AM
MaskRay updated this revision to Diff 212767.Thu, Aug 1, 2:49 AM
MaskRay retitled this revision from [ELF] Pad the last page of last PF_X PT_LOAD with traps when -z separate-code is specified to [ELF] Add -z separate-code and pad the last page of last PF_X PT_LOAD with traps only if -z separate-code is specified.
MaskRay edited the summary of this revision. (Show Details)

Update title and description

MaskRay updated this revision to Diff 212768.Thu, Aug 1, 2:56 AM
MaskRay edited the summary of this revision. (Show Details)

Update description

This revision was automatically updated to reflect the committed changes.
labath added a subscriber: labath.Thu, Aug 1, 4:38 AM

It seems this has broken some tests in lldb :/ http://lab.llvm.org:8011/builders/lldb-x64-windows-ninja/builds/7321. By the looks of things, the new layout of sections has triggered some kind of a debug_line parsing bug. I'm looking at whether there is an easy fix for that (I am hoping that there is one).

MaskRay added a comment.EditedThu, Aug 1, 5:05 AM

It seems this has broken some tests in lldb :/ http://lab.llvm.org:8011/builders/lldb-x64-windows-ninja/builds/7321. By the looks of things, the new layout of sections has triggered some kind of a debug_line parsing bug. I'm looking at whether there is an easy fix for that (I am hoping that there is one).

My bad.. I didn't know some lldb tests run ld.lld.. Non-SHF_ALLOC sections such as .comment, .symtab, and .shstrtab

Since you're already looking at them (and it seems you added these tests:) I'll not fix at the same time then. If the bug is not easy to fix, try adding -z separate-code.

Edit: worked around lldb tests in rLLDB367549