Page MenuHomePhabricator

Support -fstack-clash-protection for x86
Needs ReviewPublic

Authored by serge-sans-paille on Oct 9 2019, 11:30 AM.

Details

Reviewers
rnk
craig.topper
Summary

Implement protection against the stack clash attack [0] through inline stack probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic allocation to make sure the page guard, if any, is touched when touching the stack, in a similar manner to GCC[1].

This extends the existing probe-stack mechanism with a special value inline-asm. Technically the former uses function call before stack allocation while this patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

Diff Detail

Event Timeline

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 9 2019, 11:30 AM

Is there some reason this isn't using the existing stack-probe-size attribute?

Sorry, I meant the "probe-stack" attribute.

@efriedma : there's indeed an intersection with the probe-stack attribute. The probe-stack attribute (a) forces a function call, and (b) this function call only happens before the stack gets expanded.

(a) is probably a performance issue in several cases, plus it requires an extra register (that's mentioned in https://reviews.llvm.org/D9653)
(b) is an issue, as pointed out in https://lwn.net/Articles/726587/ (grep for valgrind) : from valgrind point of view, accessing un-allocated stack memory triggers error, and we probably want to please valgrind

Doing the call *after* the stack allocation is also not an option, as a signal could be raised between the stack allocation and the stack probing, escaping the stack probe if a custom signal handler is executed.

That being said, I do think it would be a good thing to have a special value for probe-stack, say probe-stack=inline-asm, that would trigger generation of inlined assembly as I do. That way we have all the pieces in one place, with different strategies. And we would have clang set the attribute for each function when -fstack-clash-protection is given.

Move to stack-probe compatibility, using a dedicated name to trigger inline assembly. It looks better to me because

  1. it leverage existing mechanics
  2. it has a finer grain

@efriedma alos compared to probe-stack with a function, this version has the ability to use existing MOV operations to avoid generating probes, which looks like a big plus to me.

(b) is an issue, as pointed out in https://lwn.net/Articles/726587/ (grep for valgrind) : from valgrind point of view, accessing un-allocated stack memory triggers error, and we probably want to please valgrind

Doing the call *after* the stack allocation is also not an option, as a signal could be raised between the stack allocation and the stack probing, escaping the stack probe if a custom signal handler is executed.

I'm not sure I follow. How are you solving this problem in your patch? By limiting the amount you adjust the stack at a time? What limit is sufficient to avoid this issue?


Can you give a complete assembly listing for small examples of static and dynamic stack probing?

llvm/lib/Target/X86/X86FrameLowering.cpp
482

Why is this code in a different location from the stack probing code that generates a call?

490

This algorithm needs some documentation; it isn't at all obvious what it's doing. Particularly the interaction with "free" stack probes.

Should we generate a loop if the stack frame is large?

Please add info about this new feature to release notes

Added documentation + release note entry

@efriedma the free probe algorithm requires more testing, and I'd like to take into account memset and memcpy as free probes too. To showcase this algorithm, consider the following LLVM bitcode:

define i32 @foo() local_unnamed_addr {
  %a = alloca i32, i64 2000, align 16
  %b = getelementptr inbounds i32, i32* %a, i64 1198
  store volatile i32 1, i32* %b
  %c = load volatile i32, i32* %a
  ret i32 %c
}

when compiled with llc it outputs the following assembly:

foo:                                    # @foo
    subq    $7880, %rsp             # imm = 0x1EC8
    movl    $1, 4664(%rsp)
    movl    -128(%rsp), %eax
    addq    $7880, %rsp             # imm = 0x1EC8
    retq

When probe-stack is set to inline-asm it outputs

foo:                                    # @foo
	subq	$4096, %rsp             # imm = 0x1000
	movl	$1, 880(%rsp)
	subq	$3784, %rsp             # imm = 0xEC8
	movq	$0, (%rsp)
	movl	-128(%rsp), %eax
	addq	$7880, %rsp             # imm = 0x1EC8
	retq

The stack allocation is split in two, but only one MOV is added, the first one is what I call a free probe. Turns out we could only use natural probes here, I need to implement that :-)

As a comparison, setting probe-stack to a random function name like __probe_stack outputs the following:

foo:                                    # @foo
	movl	$8008, %eax             # imm = 0x1F48
	callq	__probe_stack
	subq	%rax, %rsp
	movl	$1, 4792(%rsp)
	movl	(%rsp), %eax
	addq	$8008, %rsp             # imm = 0x1F48
	retq

which requires runtime support (to provide __stack_probe), and a function call overhead, while ideally just an extra sub %rsp would be needed.

serge-sans-paille marked 3 inline comments as done.Oct 10 2019, 2:09 AM
serge-sans-paille added inline comments.
llvm/lib/Target/X86/X86FrameLowering.cpp
482

Because BuildStackAdjustment has more callsites than just emitPrologue and we want to capture all stack manipulation.

490

Should we generate a loop if the stack frame is large?

That's an option. it's more complex to make looping compatible with free probes, but that's possible.

490

This algorithm needs some documentation

Done

serge-sans-paille edited the summary of this revision. (Show Details)

Added test case, statistics and refactor interactions with existing stack probing mechanism.

Some early stats: on the sqlite amalgamation [0], the free probe reuse allows to skip 123 out of the 474 probes needed during frame lowering.

[0] https://www.sqlite.org/download.html

xbolva00 added inline comments.Oct 10 2019, 12:06 PM
llvm/lib/Target/X86/X86FrameLowering.cpp
506

nit: llvm::any_of

efriedma added inline comments.Oct 10 2019, 3:38 PM
llvm/lib/Target/X86/X86FrameLowering.cpp
498

Each probe has to have an offset of at most PageSize bytes from the previous probe. If each probe is exactly PageSize bytes away from the previous probe, that's fine. But it looks like you don't enforce the distance between free probes correctly?

508

There are instructions that don't refer to any FI, but are still relevant. For example, calls.

tstellar added inline comments.Oct 10 2019, 4:05 PM
clang/test/CodeGen/stack-clash-protection.c
4

There were concerns[1] raised recently about adding clang tests that were codegen dependent. Is something being tested here that can't be tested with an IR test? If you only need to test that the frontend option work, I think checking the IR for the necessary function attributes might be enough.

[1] http://lists.llvm.org/pipermail/cfe-dev/2019-September/063309.html

Ensure the distance between two probes is at max PAGE_SIZE.
Use Calls as free probes.
Fix alignment for dynamic alloca

This passes the llvm-test suite, and thanks to the use of calls, no inserted probe are needed to compile sqlite!

rnk added a comment.Oct 14 2019, 3:25 PM

For maintenance reasons, I'd really prefer it if we could find a way to reuse the existing code that calls an external stack probe function. What do you think about taking a look at X86RetpolineThunks.cpp and doing something similar to that? Basically, when the user sets -fstack-clash-protection, LLVM will emit a small comdat+weak function into every object file that has the same ABI as the existing stack probe mechanism. For other prior art, you can also look at how __clang_call_terminate works.

llvm/lib/Target/X86/X86FrameLowering.cpp
428

Please no dynamically initialized globals. The LLVM-y thing would be to use a switch.

482

If you care about those cases, we should have tests for all of them. These are all the cases:

  1. TailCallReturnAddrDelta: When doing guaranteed TCO for a callee with >4K of argument memory
  2. eliminateCallFramePseudoInstr: When doing stack adjustments to set up a call with more than 4K of arguments

Both of these cases involve setting up arguments to calls, and I think we can guarantee that the compiler will emit writes to every argument stack slot. So, to set up one of these cases, we'd have this code:

struct EightK { int a[2048]; };
void passEightK(EightK);
void foo() {
  EightK x;
  memset(&x, 0, sizeof(x));
  passEightK(x); // some targets will pass indirect, 32-bit probably uses byval.
}

In this case, LLVM happens to use rep;movsl to copy the argument memory. It could use memcpy, but either way, it will probe all those bytes.

I think that means it's safe to move to emitSPUpdate. I would greatly prefer that BuildStackAdjustment remain a simple primitive that generates a single instruction.

llvm/test/CodeGen/X86/stack-clash-dynamic-alloca.ll
10–12

This seems like a good use case for update_llc_test_checks.py. I'd want to see the body of the loop, the cmp, the jl, etc.

Some benchmarks, running the python performance suite from recompiled cpython build, one built with -O2 (baseline) and one built with -O2 -fstack-clash-protection (protection). Surprisingly enough, the stack protection makes the code faster in various scenario, but none of these changes present significant regression, according to me.

baseline.json
=============

Performance version: 0.9.1
Report on Linux-3.10.0-891.el7.x86_64-x86_64-with-redhat-7.5-Maipo
Number of logical CPUs: 8
Start date: 2019-10-15 11:16:28.425344
End date: 2019-10-15 11:37:55.064599

protection.json
===============

Performance version: 0.9.1
Report on Linux-3.10.0-891.el7.x86_64-x86_64-with-redhat-7.5-Maipo
Number of logical CPUs: 8
Start date: 2019-10-15 10:55:24.270166
End date: 2019-10-15 11:16:27.195366

+-------------------------+---------------+-----------------+--------------+-----------------------+
| Benchmark               | baseline.json | protection.json | Change       | Significance          |
+=========================+===============+=================+==============+=======================+
| 2to3                    | 435 ms        | 448 ms          | 1.03x slower | Significant (t=-4.26) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| chameleon               | 15.0 ms       | 14.7 ms         | 1.02x faster | Significant (t=4.76)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| chaos                   | 176 ms        | 174 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| crypto_pyaes            | 153 ms        | 150 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| deltablue               | 11.9 ms       | 11.9 ms         | 1.00x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| django_template         | 210 ms        | 223 ms          | 1.06x slower | Significant (t=-3.84) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| dulwich_log             | 96.1 ms       | 102 ms          | 1.06x slower | Significant (t=-8.42) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| fannkuch                | 703 ms        | 698 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| float                   | 161 ms        | 160 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| genshi_text             | 45.5 ms       | 45.5 ms         | 1.00x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| genshi_xml              | 95.3 ms       | 95.2 ms         | 1.00x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| go                      | 392 ms        | 382 ms          | 1.03x faster | Significant (t=5.95)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| hexiom                  | 16.0 ms       | 15.9 ms         | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| html5lib                | 137 ms        | 135 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| json_dumps              | 18.8 ms       | 17.9 ms         | 1.05x faster | Significant (t=4.81)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| json_loads              | 41.3 us       | 40.2 us         | 1.03x faster | Significant (t=2.53)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| logging_format          | 17.4 us       | 17.5 us         | 1.00x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| logging_silent          | 509 ns        | 500 ns          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| logging_simple          | 14.3 us       | 14.8 us         | 1.04x slower | Significant (t=-5.64) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| mako                    | 30.6 ms       | 30.4 ms         | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| meteor_contest          | 134 ms        | 130 ms          | 1.03x faster | Significant (t=4.07)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| nbody                   | 186 ms        | 180 ms          | 1.03x faster | Significant (t=4.87)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| nqueens                 | 152 ms        | 145 ms          | 1.05x faster | Significant (t=5.26)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pathlib                 | 29.5 ms       | 27.7 ms         | 1.07x faster | Significant (t=4.33)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pickle                  | 15.7 us       | 15.6 us         | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pickle_dict             | 45.1 us       | 43.2 us         | 1.04x faster | Significant (t=34.35) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pickle_list             | 6.05 us       | 5.67 us         | 1.07x faster | Significant (t=5.15)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pickle_pure_python      | 766 us        | 786 us          | 1.03x slower | Significant (t=-3.19) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pidigits                | 206 ms        | 208 ms          | 1.01x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| python_startup          | 12.4 ms       | 12.7 ms         | 1.03x slower | Significant (t=-8.03) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| python_startup_no_site  | 7.62 ms       | 7.75 ms         | 1.02x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| raytrace                | 830 ms        | 805 ms          | 1.03x faster | Significant (t=15.48) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| regex_compile           | 264 ms        | 261 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| regex_dna               | 195 ms        | 195 ms          | 1.00x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| regex_effbot            | 3.59 ms       | 3.50 ms         | 1.02x faster | Significant (t=7.90)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| regex_v8                | 30.0 ms       | 29.3 ms         | 1.02x faster | Significant (t=7.10)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| richards                | 122 ms        | 120 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_fft             | 466 ms        | 462 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_lu              | 358 ms        | 334 ms          | 1.07x faster | Significant (t=5.46)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_monte_carlo     | 167 ms        | 155 ms          | 1.08x faster | Significant (t=4.68)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_sor             | 346 ms        | 328 ms          | 1.05x faster | Significant (t=7.55)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_sparse_mat_mult | 5.49 ms       | 5.14 ms         | 1.07x faster | Significant (t=13.23) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| spectral_norm           | 209 ms        | 199 ms          | 1.05x faster | Significant (t=4.10)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| sympy_expand            | 650 ms        | 638 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| sympy_integrate         | 30.8 ms       | 30.2 ms         | 1.02x faster | Significant (t=9.94)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| sympy_str               | 425 ms        | 407 ms          | 1.04x faster | Significant (t=4.26)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| sympy_sum               | 248 ms        | 248 ms          | 1.00x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| telco                   | 9.63 ms       | 9.37 ms         | 1.03x faster | Significant (t=9.26)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| tornado_http            | 371 ms        | 369 ms          | 1.00x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| unpack_sequence         | 60.1 ns       | 59.0 ns         | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| unpickle                | 19.3 us       | 18.9 us         | 1.02x faster | Significant (t=8.65)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| unpickle_list           | 4.72 us       | 4.67 us         | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| unpickle_pure_python    | 576 us        | 562 us          | 1.02x faster | Significant (t=7.27)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| xml_etree_generate      | 156 ms        | 153 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| xml_etree_iterparse     | 157 ms        | 146 ms          | 1.08x faster | Significant (t=6.27)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| xml_etree_parse         | 187 ms        | 183 ms          | 1.02x faster | Significant (t=4.14)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| xml_etree_process       | 126 ms        | 123 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+

For maintenance reasons, I'd really prefer it if we could find a way to reuse the existing code that calls an external stack probe function. What do you think about taking a look at X86RetpolineThunks.cpp and doing something similar to that? Basically, when the user sets -fstack-clash-protection, LLVM will emit a small comdat+weak function into every object file that has the same ABI as the existing stack probe mechanism. For other prior art, you can also look at how __clang_call_terminate works.

@rnk I 100% understand the reasoning, and will have a look to the files you're pointing at. AFAIU the approach you're suggesting is incompatible with the « free probe » algorithm that moves instruction in between stack allocations, but let me confirm that first.

Another test run on an x264 encoder (source: https://openbenchmarking.org/test/pts/x264)

Compiled with -O2 and with or without -fstack-clash-protection; Run without threads (x265 --pools 1 -F 1 ./Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m /dev/null)

Clang

baseline:  318.60s 
protection:  317.72s

So no performance impact beyond noise.

The compilation inserts 44 inline probes in 9 functions.

gcc

Out of comparison, with gcc 8.2 (yeah, it's a bit old), I get (same flags & setup)

baseline : 417.53 sec
protected : 412.6 sec

The compilations inserts inline probes in 22 functions.
So gcc inserts more probes, the impact on performance is equally surprising. I need to gather more points to get some statistical informations there.

Get rid of static mapping + update test cases

serge-sans-paille marked 6 inline comments as done.Oct 15 2019, 8:37 AM
dmajor added a subscriber: dmajor.Oct 16 2019, 10:44 AM

Moved the implementation to a specialization of emitStackProbeInline that used to be Windows-centric. Provide a generic implementation that generates inline assembly instead. that way we don't clutter existing functions, and only add a new case to emitSPUpdate.

Handle large stack allocation with a loop instead of a (large) basic block.

@rnk: this is far less intrusive and more integrated to the existing structure, thanks a lot for hinting in that direction.

Extra note: an older version of the patch has been tested by the firefox team without much performance impact, (and no test failure), see https://bugzilla.mozilla.org/show_bug.cgi?id=1588710

Explicilty prefer existing mechanism for windows.
Split loop/block allocation strategy to different functions.

You should also probably document the arg in
clang/docs/ClangCommandLineReference.rst
we have -fstack-protector-strong & co in the doc

Better integration with MachineInstr description
Handle stacks with multiple stack objects
More test case

Update documentation as suggested by @sylvestre.ledru
Corner case when a write was touching memory beyond the allocated stack.

Some benchmark / instrumentation report: due to the way memory moves are ordered in the entry block, there tend to be relatively few free probes between two stack growth within a function, and a large number after the last stack growth.

When recompiling llc, I only get 18 free probes between stack growth, and a large amount (several thousands) after the last stack growth.
When recompiling cpython, I only get 4 free probes between stack growth.

This may be improved by rescheduling instructions / changing stack object layout but that's probably not worth the effort (there's not that much functions with stack space > a page)

Temporarily comment out call support as free probe, everything else passes validation but a call may have some stack effect I don't handle (yet).

clang/docs/ClangCommandLineReference.rst
1901 ↗(On Diff #226109)

Maybe add that it is Linux only? :)

serge-sans-paille edited the summary of this revision. (Show Details)

Moved to a simpler, easier to review, less error prone implementation. Validates just fines on llvm bootstrap an cpython code base. @sylvestre.ledru that's the patch you can play with.

@rnk what's your take on that version?

@sylvestre.ledru did the testing and benchmarking on firefox (see https://bugzilla.mozilla.org/show_bug.cgi?id=1588710#c12), everything seems ok, let's move forward?

@sylvestre.ledru did the testing and benchmarking on firefox (see https://bugzilla.mozilla.org/show_bug.cgi?id=1588710#c12), everything seems ok, let's move forward?

Actually, not sure about the benchmark anymore. I am checking with experts. Sorry