This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Pad x86 executable sections with 0xcc int3 instructions
ClosedPublic

Authored by jhenderson on Mar 13 2017, 5:20 AM.

Details

Summary

See PR32227. Executable sections should not be padded with zero by default. On some architectures, 0x00 is the start of a valid instruction sequence, so can confuse disassembly between InputSections (and indeed the start of the next InputSection in some situations). Further, in the case of misjumps into padding, padding may start to be executed silently.

On x86, the "0xcc" byte represents the int3 trap instruction. It is a single byte long so can serve well as padding. This change switches x86 (and x86_64) to use this value for padding in executable sections, if no linker script directive overrides it. It also puts the behaviour into place making it easy to change the behaviour of other targets when desired. I do not know the relevant instruction sequences for trap instructions on other targets however, so somebody should add this separately.

Because the old behaviour simply wrote padding in the whole section before overwriting most of it, this change also modifies the padding algorithm to write padding only where needed. This in turn has caused a small behaviour change with regards to what values are written via Fill commands in linker scripts, bringing it into line with ld.bfd. The fill value is now written starting from the end of the previous block, which means that it always starts from the first byte of the fill, whereas the old behaviour meant that the padding sometimes started mid-way through the fill value. See the test changes for more details.

Diff Detail

Repository
rL LLVM

Event Timeline

jhenderson created this revision.Mar 13 2017, 5:20 AM
joerg added a subscriber: joerg.Mar 13 2017, 6:06 AM

Does this have any impact on compressibility of executables? E.g. gzip or xz.

hfinkel added inline comments.
ELF/Target.h
102 ↗(On Diff #91539)

Making this a uint32_t implies (at least to me) that it might make sense for the value to be something other than a single repeated byte. Does it make sense for it to be something other than that?

hfinkel added inline comments.Mar 13 2017, 6:29 AM
ELF/Target.h
102 ↗(On Diff #91539)

Never mind; the value needs to be some multiple of the length of the trap instruction, or similar, which is likely uses the smallest available encoding size (whatever that happens to be for the target architecture). So long as we never need to fill in regions smaller than that, then there should be no ambiguity.

Does this have any impact on compressibility of executables? E.g. gzip or xz.

Without knowing much (or really anything) about the compression algorithms used, I cannot say for certain. A number of factors would likely be involved, including the number of input sections and amount of padding required between each, which will depend on alignment requirements. I would guess that it may have a small negative impact in general, but given that all padding between executable sections will be the same sequence of bytes, at least for x86, I don't expect it to be significant.

Do you have or know of any straightforward way to test this?

ELF/Target.h
102 ↗(On Diff #91539)

Also of note is that LLD and ld.gold use 32-bit integers as the input for the FILL directive, which is the way to override this value via a linker script.

ruiu edited edge metadata.Mar 13 2017, 9:25 AM

I actually tried the .exact same thing last week and observed a significant (more than 10%) performance regression, probably due to the regression of memory locality. Can you measure the performance on a multiprocessor machine?

ELF/OutputSections.cpp
244 ↗(On Diff #91539)

I prefer TrapInstr over DefaultExecutableFiller as it is more concrete than than "the default executable filler".

ruiu added a comment.Mar 13 2017, 9:32 AM

Does this have any impact on compressibility of executables? E.g. gzip or xz.

We should just measure it, but I don't expect that this would severely hurt compressability of executables. This patch fills gaps that used to be filled with zero bytes with another repetitive bit pattern, so it shouldn't increase entropy that much.

grimar added a subscriber: grimar.Mar 14 2017, 1:20 AM
grimar added inline comments.
ELF/OutputSections.cpp
241 ↗(On Diff #91539)

I do not think that works perfect.
That will work wrong if linkerscript sets filler to 0x00 explicitly:

.text: { *(.text*) }=0x00

In that case it still will fill gaps fith 0xcc, what is not correct.
I had to use llvm::Optional in D30901 for Filler member to fix that.

Though I am not sure how much this case is real, so please
just check other reviewers opinion about this.

test/ELF/default-fill.s
5 ↗(On Diff #91539)

You also implemented i686 target, so probably need a test for i686-pc-linux triple.

10 ↗(On Diff #91539)

I would do check to be more explicit:

# RUN: llvm-objdump -s %t.out | FileCheck %s
# CHECK:      11cccccc cccccccc cccccccc cccccccc
# CHECK-NEXT: 22

Because what you want to check is just that whole area between input sections
was filled by some pattern.

20 ↗(On Diff #91539)

You do not need .globl _start/.globl other. All you probably need here is 2 sections with some markers at start probably, like:

.section        .text.1,"ax"
.align  16
.byte   0x11

.section        .text.2,"ax"
.align  16
.byte   0x22
jhenderson updated this revision to Diff 91731.Mar 14 2017, 8:41 AM

Thanks @grimar and @ruiu. I have updated the diff to incorporate your suggestions and fixes. I've also incorporated some of @grimar's test ideas from D30901, since they were useful. I hope you don't mind me shamelessly copying!

I haven't yet tried to compare performance. @ruiu, what do you use to measure LLD's performance?

jhenderson marked 5 inline comments as done.Mar 14 2017, 8:42 AM
jhenderson added inline comments.
ELF/OutputSections.cpp
241 ↗(On Diff #91539)

Thanks, good spot.

ruiu added a comment.Mar 14 2017, 10:13 AM

You can just build lld or clang with debug info and then link it with LLD with and without this patch.

jhenderson marked an inline comment as done.Mar 16 2017, 7:39 AM

I have just finished comparing the performance of LLD with and without this patch. I used the "time" command to time the runs of LLD when linking debug builds of clang and LLD. I ran the runs 6 times each, discarding the first, and compared the results. This was on an Ubuntu 14.04 LTS VM, with 32GB of RAM, and access to 6 processor cores.

Average LLD link time without patch:
real - 3.349s
user - 5.136s
sys - 1.329s

Average LLD link time with patch:
real - 3.302s
user - 5.052s
sys - 1.274s

Average Clang link time without patch:
real - 6.046s
user - 9.386s
sys - 2.323s

Average Clang link time with patch:
real - 6.154s
user - 9.426s
sys - 2.279s

From these results, I observe no change in performance. I'm afraid I cannot see where the regression that you saw is coming from.

jhenderson updated this revision to Diff 92004.Mar 16 2017, 8:49 AM

Updated diff to merge with head cleanly.

ruiu added a comment.Mar 16 2017, 3:27 PM

I applied this patch and still see the difference. My machine is a 2-socket Xeon E5-2680 v2 @ 2.80GHz. I used perf stat -r 5 <linker command line> to run the linker five times for each test condition.

With this patch:

   32169.548068 task-clock (msec)         #    3.716 CPUs utilized            ( +-  2.78% )
        179,956 context-switches          #    0.006 M/sec                    ( +-  0.24% )
         11,756 cpu-migrations            #    0.365 K/sec                    ( +-  6.92% )
      2,569,253 page-faults               #    0.080 M/sec                    ( +-  0.99% )
 82,125,867,840 cycles                    #    2.553 GHz                      ( +-  2.96% )
 61,896,938,065 stalled-cycles-frontend   #   75.37% frontend cycles idle     ( +-  3.66% )
<not supported> stalled-cycles-backend  
 52,720,252,362 instructions              #    0.64  insns per cycle        
                                          #    1.17  stalled cycles per insn  ( +-  0.34% )
 10,152,771,263 branches                  #  315.602 M/sec                    ( +-  0.45% )
    162,810,720 branch-misses             #    1.60% of all branches          ( +-  0.19% )

    8.658035887 seconds time elapsed                                          ( +-  1.06% )

Without this patch:

   32446.779894 task-clock (msec)         #    3.929 CPUs utilized            ( +-  1.66% )
        181,730 context-switches          #    0.006 M/sec                    ( +-  0.44% )
         11,550 cpu-migrations            #    0.356 K/sec                    ( +-  7.07% )
      2,550,740 page-faults               #    0.079 M/sec                    ( +-  0.79% )
 82,878,006,345 cycles                    #    2.554 GHz                      ( +-  2.15% )
 62,603,702,917 stalled-cycles-frontend   #   75.54% frontend cycles idle     ( +-  2.79% )
<not supported> stalled-cycles-backend
 52,760,009,322 instructions              #    0.64  insns per cycle
                                          #    1.19  stalled cycles per insn  ( +-  0.21% )
 10,165,389,935 branches                  #  313.294 M/sec                    ( +-  0.24% )
    163,535,769 branch-misses             #    1.61% of all branches          ( +-  0.29% )

    8.259118872 seconds time elapsed                                          ( +-  1.04% )

In this patch, one thread writes to a large memory region to initialize that with a fixed pattern, and then immediately after that, other threads read from and write to that memory region. I think that memory access pattern can be quite expensive on a mult-processor machine. But I don't know why that doesn't happen on your machine.

In D30886#703297, @ruiu wrote:

In this patch, one thread writes to a large memory region to initialize that with a fixed pattern, and then immediately after that, other threads read from and write to that memory region. I think that memory access pattern can be quite expensive on a mult-processor machine. But I don't know why that doesn't happen on your machine.

Me neither - perhaps we've got slightly different build settings, e.g. you've got -ffunction-sections enabled, which would increase the amount of work for the linker to do. However, I do see where performance should be negatively affected, and so potential for an alternative implementation. As far as I can understand from the code as is, the fill is written to the whole output section, and then the input sections written over the top. This is doing unnecessary extra work, I would think - typically the amount of padding in the .text section is far outweighed by the amount of executable instructions, so would it not make more sense to write the fill only to the places that need it, presumably after each InputSection is written? I would expect this to significantly improve performance, because much less work is being done.

Just to add, the reason we don't see the performance issue before this patch is because nothing is being written (for padding) usually at all. I would expect to see the same performance regression using a linker script with a non-zero fill command for the output section.

ruiu added a comment.Mar 17 2017, 9:12 AM

Agreed. Before this, we used a filler pattern only when the linker script =<fillexp> feature was used. Now that we are going to fill the gap unconditionally. Therefore, the performance of doing that is important. As you wrote, we should set only to the gap and not the memory region that will immediately be overwritten. Also, for better memory locality, we should fill the gap in the callback function of forEach.

jhenderson updated this revision to Diff 93608.Mar 31 2017, 2:52 AM
jhenderson edited the summary of this revision. (Show Details)

I have rewritten the patch to now only write padding immediately before writing InputSection and BYTE()-style commands. This has required recording where to start writing padding when assigning offsets, which accounts for the majority of the changes.

I noticed in the process of this a couple of differences with the ld.bfd behaviour around FILL()/=fillexp commands. The first is that prior to this change, padding was being written before any data was written, which meant that padding sometimes with a value mid-way through the FILL value. This change fixes that (see, for example, the changes to test/ELF/linkerscript/section-padding.s). It may be possible to keep the old behaviour, if that is preferred, but it would make the patch slightly more complex. The second is that only one FILL() command in an output section is used (see the ELF/linkerscript/fill.s test), whereas ld.bfd uses the most recent seen fill for each separate block of padding in a section. I have not tried to fix this in this patch and will separately file a bug for this.

@ruiu, I expect that this will fix the performance issues you were seeing, but since I couldn't see them before, I have no way of checking. Would you mind re-running the testing you did before?

The second is that only one FILL() command in an output section is used (see the ELF/linkerscript/fill.s test), whereas ld.bfd uses the most recent seen fill for each separate block of padding in a section. I have not tried to fix this in this patch and will separately file a bug for this.

I see this was already discussed in PR30243 and D24186. I wonder whether adding finer-grain control may be easier with this new method of writing fills.

I measured the size difference in the before and after links of lld and clang when compressed using gzip:

lld uncompressed size before = 873234440
lld uncompressed size after = 873242664
lld compressed size before = 256393477
lld compressed size after = 256389682
lld compression ratio before (compressed/uncompressed) = 29.3614%
lld compression ratio after = 29.3606%

clang uncompressed size before = 1608322256
clang uncompressed size after = 1608322256
clang compressed size before = 456620551
clang compressed size after = 456595069
clang compression ratio before = 28.3911%
clang compression ratio after = 28.3895%

So interestingly, the compression actually improves fractionally with this patch.

ruiu added a comment.Mar 31 2017, 1:04 PM

This seems to be updating too many function just to fill section gaps with INT3 or equivalent. Doesn't something like this work? (This is a rough patch, so not all tests pass, but I think you can get an idea.)

diff --git a/lld/ELF/OutputSections.cpp b/lld/ELF/OutputSections.cpp
index cda8a2b3f42..a77ae6acf73 100644

  • a/lld/ELF/OutputSections.cpp

+++ b/lld/ELF/OutputSections.cpp
@@ -238,8 +238,21 @@ template <class ELFT> void OutputSection::writeTo(uint8_t *Buf) {

if (uint32_t Filler = Script->getFiller(this->Name))
  fill(Buf, this->Size, Filler);
  • parallelForEach(Sections.begin(), Sections.end(),
  • (InputSection *IS) { IS->writeTo<ELFT>(Buf); });

+ parallelFor(0, Sections.size(), [=](size_t I) {
+ InputSection *Sec = Sections[I];
+ Sec->writeTo<ELFT>(Buf);
+
+ // Fill gaps between executable sections with INT3 or equivalent.
+ if (Sec->Flags & SHF_EXECINSTR) {
+ uint8_t *Start = Buf + Sec->OutSecOff + Sec->getSize();
+ uint8_t *End;
+ if (I + 1 == Sections.size())
+ End = Buf + this->Size;
+ else
+ End = Buf + Sections[I + 1]->OutSecOff;
+ fill(Start, End - Start, 0xcccccccc);
+ }
+ });

// Linker scripts may have BYTE()-family commands with which you
// can write arbitrary bytes to the output. Process them if any.
ruiu added a comment.Mar 31 2017, 1:06 PM

Phabricator formatted my comment in a funny way, so I pasted it here: https://gist.github.com/rui314/19a059a461dd787ebd4694aebb3b7658

jhenderson updated this revision to Diff 93853.Apr 3 2017, 6:40 AM

Thanks @ruiu. I clearly overthought the previous attempt! I've updated the diff based on your suggestions. There are a few points I feel are worth mentioning:

  • I've had to add an additional check to prevent the .eh_frame_hdr section from being written if it is empty (previously its writeTo method was called, which did nothing, but now the writeTo method writes padding to fill the section). I think there's a wider issue here regarding .eh_frame_hdr, namely that writeSections unconditionally writes this section even if it is empty and has no corresponding .eh_frame section. It has previously been discarded from the main output sections list, but that is not used in this case. I think fixing this is a separate change.
  • BYTE()-family commands are now ignored for the purposes of padding, and are written over the top, meaning that the padding does not start with the correct byte for these cases. This is no worse than before, but I will fix it in a separate change.
  • There was no test for the case with leading padding caused by assigning to dot in the linker script. I have extended a case, to add coverage, since I am changing behaviour in the area.
ruiu added a comment.Apr 3 2017, 11:54 AM

Generally looking good.

ELF/LinkerScript.cpp
943 ↗(On Diff #93853)

Return None.

ELF/OutputSections.cpp
239 ↗(On Diff #93853)

This function doesn't use any member variable of OutputSection, so it should be a non-member function.

244–248 ↗(On Diff #93853)

You can return early.

if (Optional<uint32_t> Filler = Script->getFiller(this->Name))
  return *Filler;
if (Flags & SHF_EXECINSTR)
  return Target->TrapInstr;
return 0;
257–264 ↗(On Diff #93853)

Not sure if we need to care about these edge cases. I wouldn't probably do anything for them to keep code simple.

ELF/SyntheticSections.cpp
2183–2186 ↗(On Diff #93853)

If this is zero, this is no-op, right? If so, enclosing it with if is probably more natural.

if (Optional<uint32_t> Filler = Script->getFiller(this->Name)) {
  ....
}
jhenderson marked 3 inline comments as done.Apr 4 2017, 6:08 AM
jhenderson added inline comments.
ELF/LinkerScript.cpp
943 ↗(On Diff #93853)

Thanks, I didn't know about that.

ELF/OutputSections.cpp
239 ↗(On Diff #93853)

It does use two member variables - Name, and Flags, though they could relatively easily be passed in, if you prefer not to add the member-function.

By the way, is there a preferred policy on explicitly qualifying member variables with "this->"?

257–264 ↗(On Diff #93853)

How strong are you objections? The behaviour matches ld.bfd, gold, and what lld did before. In the example use case of using trap instructions to catch misjumps, we would want the entire executable section to be padded, not just the bits between InputSections.

That said, I can make a simpler change to get the same behaviour.

jhenderson updated this revision to Diff 94063.Apr 4 2017, 6:15 AM
jhenderson marked an inline comment as done.

Addressed review comments.

ruiu added inline comments.Apr 4 2017, 11:21 AM
ELF/OutputSections.cpp
239 ↗(On Diff #93853)

Sometimes you need to explicitly write this-> for a template (http://stackoverflow.com/questions/4643074/why-do-i-have-to-access-template-base-class-members-through-the-this-pointer). We write this-> if that's the case.

257–264 ↗(On Diff #93853)

Since the current code is small, I'm fine with it.

254 ↗(On Diff #94063)

FillValue -> Filler as its shorter. (Also, a variable has a value, so Value is usually redundant.)

257 ↗(On Diff #94063)

Can Sections be empty?

jhenderson updated this revision to Diff 94250.Apr 5 2017, 9:53 AM
jhenderson marked an inline comment as done.

Rename variable and remove unrelated non-functional change from diff (this will be committed separately). Also updated one of the tests to be independent of the section order.

jhenderson added inline comments.Apr 5 2017, 9:55 AM
ELF/OutputSections.cpp
239 ↗(On Diff #93853)

Ah, I think I know why I was confused now - some places that were formally templated are no longer, so presumably they had this-> before, but no longer needed it.

257 ↗(On Diff #94063)

Yes, if a linker script assigns to "." or uses a BYTE()-style command within an output section, but no inputs have sections that go in that output section. I'd write a test, but PR32537 makes it quite hard to do so.

ruiu accepted this revision.Apr 5 2017, 10:04 AM

LGTM

ELF/OutputSections.cpp
257 ↗(On Diff #94250)

Sections.front() -> Sections[0].

This revision is now accepted and ready to land.Apr 5 2017, 10:04 AM
This revision was automatically updated to reflect the committed changes.