This is an archive of the discontinued LLVM Phabricator instance.

Support -fstack-clash-protection for x86
ClosedPublic

Authored by serge-sans-paille on Oct 9 2019, 11:30 AM.

Details

Summary

Implement protection against the stack clash attack [0] through inline stack probing.

Probe stack allocation every PAGE_SIZE during frame lowering or dynamic allocation to make sure the page guard, if any, is touched when touching the stack, in a similar manner to GCC[1].

This extends the existing probe-stack mechanism with a special value inline-asm. Technically the former uses function call before stack allocation while this patch provides inlined stack probes and chunk allocation.

Only implemented for x86.

[0] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[1] https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

Diff Detail

Event Timeline

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 9 2019, 11:30 AM

Is there some reason this isn't using the existing stack-probe-size attribute?

Sorry, I meant the "probe-stack" attribute.

@efriedma : there's indeed an intersection with the probe-stack attribute. The probe-stack attribute (a) forces a function call, and (b) this function call only happens before the stack gets expanded.

(a) is probably a performance issue in several cases, plus it requires an extra register (that's mentioned in https://reviews.llvm.org/D9653)
(b) is an issue, as pointed out in https://lwn.net/Articles/726587/ (grep for valgrind) : from valgrind point of view, accessing un-allocated stack memory triggers error, and we probably want to please valgrind

Doing the call *after* the stack allocation is also not an option, as a signal could be raised between the stack allocation and the stack probing, escaping the stack probe if a custom signal handler is executed.

That being said, I do think it would be a good thing to have a special value for probe-stack, say probe-stack=inline-asm, that would trigger generation of inlined assembly as I do. That way we have all the pieces in one place, with different strategies. And we would have clang set the attribute for each function when -fstack-clash-protection is given.

Move to stack-probe compatibility, using a dedicated name to trigger inline assembly. It looks better to me because

  1. it leverage existing mechanics
  2. it has a finer grain

@efriedma alos compared to probe-stack with a function, this version has the ability to use existing MOV operations to avoid generating probes, which looks like a big plus to me.

(b) is an issue, as pointed out in https://lwn.net/Articles/726587/ (grep for valgrind) : from valgrind point of view, accessing un-allocated stack memory triggers error, and we probably want to please valgrind

Doing the call *after* the stack allocation is also not an option, as a signal could be raised between the stack allocation and the stack probing, escaping the stack probe if a custom signal handler is executed.

I'm not sure I follow. How are you solving this problem in your patch? By limiting the amount you adjust the stack at a time? What limit is sufficient to avoid this issue?


Can you give a complete assembly listing for small examples of static and dynamic stack probing?

llvm/lib/Target/X86/X86FrameLowering.cpp
427

Why is this code in a different location from the stack probing code that generates a call?

435

This algorithm needs some documentation; it isn't at all obvious what it's doing. Particularly the interaction with "free" stack probes.

Should we generate a loop if the stack frame is large?

Please add info about this new feature to release notes

Added documentation + release note entry

@efriedma the free probe algorithm requires more testing, and I'd like to take into account memset and memcpy as free probes too. To showcase this algorithm, consider the following LLVM bitcode:

define i32 @foo() local_unnamed_addr {
  %a = alloca i32, i64 2000, align 16
  %b = getelementptr inbounds i32, i32* %a, i64 1198
  store volatile i32 1, i32* %b
  %c = load volatile i32, i32* %a
  ret i32 %c
}

when compiled with llc it outputs the following assembly:

foo:                                    # @foo
    subq    $7880, %rsp             # imm = 0x1EC8
    movl    $1, 4664(%rsp)
    movl    -128(%rsp), %eax
    addq    $7880, %rsp             # imm = 0x1EC8
    retq

When probe-stack is set to inline-asm it outputs

foo:                                    # @foo
	subq	$4096, %rsp             # imm = 0x1000
	movl	$1, 880(%rsp)
	subq	$3784, %rsp             # imm = 0xEC8
	movq	$0, (%rsp)
	movl	-128(%rsp), %eax
	addq	$7880, %rsp             # imm = 0x1EC8
	retq

The stack allocation is split in two, but only one MOV is added, the first one is what I call a free probe. Turns out we could only use natural probes here, I need to implement that :-)

As a comparison, setting probe-stack to a random function name like __probe_stack outputs the following:

foo:                                    # @foo
	movl	$8008, %eax             # imm = 0x1F48
	callq	__probe_stack
	subq	%rax, %rsp
	movl	$1, 4792(%rsp)
	movl	(%rsp), %eax
	addq	$8008, %rsp             # imm = 0x1F48
	retq

which requires runtime support (to provide __stack_probe), and a function call overhead, while ideally just an extra sub %rsp would be needed.

serge-sans-paille marked 3 inline comments as done.Oct 10 2019, 2:09 AM
serge-sans-paille added inline comments.
llvm/lib/Target/X86/X86FrameLowering.cpp
427

Because BuildStackAdjustment has more callsites than just emitPrologue and we want to capture all stack manipulation.

435

Should we generate a loop if the stack frame is large?

That's an option. it's more complex to make looping compatible with free probes, but that's possible.

435

This algorithm needs some documentation

Done

serge-sans-paille edited the summary of this revision. (Show Details)

Added test case, statistics and refactor interactions with existing stack probing mechanism.

Some early stats: on the sqlite amalgamation [0], the free probe reuse allows to skip 123 out of the 474 probes needed during frame lowering.

[0] https://www.sqlite.org/download.html

xbolva00 added inline comments.Oct 10 2019, 12:06 PM
llvm/lib/Target/X86/X86FrameLowering.cpp
451

nit: llvm::any_of

efriedma added inline comments.Oct 10 2019, 3:38 PM
llvm/lib/Target/X86/X86FrameLowering.cpp
443

Each probe has to have an offset of at most PageSize bytes from the previous probe. If each probe is exactly PageSize bytes away from the previous probe, that's fine. But it looks like you don't enforce the distance between free probes correctly?

453

There are instructions that don't refer to any FI, but are still relevant. For example, calls.

tstellar added inline comments.Oct 10 2019, 4:05 PM
clang/test/CodeGen/stack-clash-protection.c
4

There were concerns[1] raised recently about adding clang tests that were codegen dependent. Is something being tested here that can't be tested with an IR test? If you only need to test that the frontend option work, I think checking the IR for the necessary function attributes might be enough.

[1] http://lists.llvm.org/pipermail/cfe-dev/2019-September/063309.html

Ensure the distance between two probes is at max PAGE_SIZE.
Use Calls as free probes.
Fix alignment for dynamic alloca

This passes the llvm-test suite, and thanks to the use of calls, no inserted probe are needed to compile sqlite!

rnk added a comment.Oct 14 2019, 3:25 PM

For maintenance reasons, I'd really prefer it if we could find a way to reuse the existing code that calls an external stack probe function. What do you think about taking a look at X86RetpolineThunks.cpp and doing something similar to that? Basically, when the user sets -fstack-clash-protection, LLVM will emit a small comdat+weak function into every object file that has the same ABI as the existing stack probe mechanism. For other prior art, you can also look at how __clang_call_terminate works.

llvm/lib/Target/X86/X86FrameLowering.cpp
373

Please no dynamically initialized globals. The LLVM-y thing would be to use a switch.

427

If you care about those cases, we should have tests for all of them. These are all the cases:

  1. TailCallReturnAddrDelta: When doing guaranteed TCO for a callee with >4K of argument memory
  2. eliminateCallFramePseudoInstr: When doing stack adjustments to set up a call with more than 4K of arguments

Both of these cases involve setting up arguments to calls, and I think we can guarantee that the compiler will emit writes to every argument stack slot. So, to set up one of these cases, we'd have this code:

struct EightK { int a[2048]; };
void passEightK(EightK);
void foo() {
  EightK x;
  memset(&x, 0, sizeof(x));
  passEightK(x); // some targets will pass indirect, 32-bit probably uses byval.
}

In this case, LLVM happens to use rep;movsl to copy the argument memory. It could use memcpy, but either way, it will probe all those bytes.

I think that means it's safe to move to emitSPUpdate. I would greatly prefer that BuildStackAdjustment remain a simple primitive that generates a single instruction.

llvm/test/CodeGen/X86/stack-clash-dynamic-alloca.ll
10–12

This seems like a good use case for update_llc_test_checks.py. I'd want to see the body of the loop, the cmp, the jl, etc.

Some benchmarks, running the python performance suite from recompiled cpython build, one built with -O2 (baseline) and one built with -O2 -fstack-clash-protection (protection). Surprisingly enough, the stack protection makes the code faster in various scenario, but none of these changes present significant regression, according to me.

baseline.json
=============

Performance version: 0.9.1
Report on Linux-3.10.0-891.el7.x86_64-x86_64-with-redhat-7.5-Maipo
Number of logical CPUs: 8
Start date: 2019-10-15 11:16:28.425344
End date: 2019-10-15 11:37:55.064599

protection.json
===============

Performance version: 0.9.1
Report on Linux-3.10.0-891.el7.x86_64-x86_64-with-redhat-7.5-Maipo
Number of logical CPUs: 8
Start date: 2019-10-15 10:55:24.270166
End date: 2019-10-15 11:16:27.195366

+-------------------------+---------------+-----------------+--------------+-----------------------+
| Benchmark               | baseline.json | protection.json | Change       | Significance          |
+=========================+===============+=================+==============+=======================+
| 2to3                    | 435 ms        | 448 ms          | 1.03x slower | Significant (t=-4.26) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| chameleon               | 15.0 ms       | 14.7 ms         | 1.02x faster | Significant (t=4.76)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| chaos                   | 176 ms        | 174 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| crypto_pyaes            | 153 ms        | 150 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| deltablue               | 11.9 ms       | 11.9 ms         | 1.00x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| django_template         | 210 ms        | 223 ms          | 1.06x slower | Significant (t=-3.84) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| dulwich_log             | 96.1 ms       | 102 ms          | 1.06x slower | Significant (t=-8.42) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| fannkuch                | 703 ms        | 698 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| float                   | 161 ms        | 160 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| genshi_text             | 45.5 ms       | 45.5 ms         | 1.00x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| genshi_xml              | 95.3 ms       | 95.2 ms         | 1.00x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| go                      | 392 ms        | 382 ms          | 1.03x faster | Significant (t=5.95)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| hexiom                  | 16.0 ms       | 15.9 ms         | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| html5lib                | 137 ms        | 135 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| json_dumps              | 18.8 ms       | 17.9 ms         | 1.05x faster | Significant (t=4.81)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| json_loads              | 41.3 us       | 40.2 us         | 1.03x faster | Significant (t=2.53)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| logging_format          | 17.4 us       | 17.5 us         | 1.00x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| logging_silent          | 509 ns        | 500 ns          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| logging_simple          | 14.3 us       | 14.8 us         | 1.04x slower | Significant (t=-5.64) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| mako                    | 30.6 ms       | 30.4 ms         | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| meteor_contest          | 134 ms        | 130 ms          | 1.03x faster | Significant (t=4.07)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| nbody                   | 186 ms        | 180 ms          | 1.03x faster | Significant (t=4.87)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| nqueens                 | 152 ms        | 145 ms          | 1.05x faster | Significant (t=5.26)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pathlib                 | 29.5 ms       | 27.7 ms         | 1.07x faster | Significant (t=4.33)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pickle                  | 15.7 us       | 15.6 us         | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pickle_dict             | 45.1 us       | 43.2 us         | 1.04x faster | Significant (t=34.35) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pickle_list             | 6.05 us       | 5.67 us         | 1.07x faster | Significant (t=5.15)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pickle_pure_python      | 766 us        | 786 us          | 1.03x slower | Significant (t=-3.19) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| pidigits                | 206 ms        | 208 ms          | 1.01x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| python_startup          | 12.4 ms       | 12.7 ms         | 1.03x slower | Significant (t=-8.03) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| python_startup_no_site  | 7.62 ms       | 7.75 ms         | 1.02x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| raytrace                | 830 ms        | 805 ms          | 1.03x faster | Significant (t=15.48) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| regex_compile           | 264 ms        | 261 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| regex_dna               | 195 ms        | 195 ms          | 1.00x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| regex_effbot            | 3.59 ms       | 3.50 ms         | 1.02x faster | Significant (t=7.90)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| regex_v8                | 30.0 ms       | 29.3 ms         | 1.02x faster | Significant (t=7.10)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| richards                | 122 ms        | 120 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_fft             | 466 ms        | 462 ms          | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_lu              | 358 ms        | 334 ms          | 1.07x faster | Significant (t=5.46)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_monte_carlo     | 167 ms        | 155 ms          | 1.08x faster | Significant (t=4.68)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_sor             | 346 ms        | 328 ms          | 1.05x faster | Significant (t=7.55)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| scimark_sparse_mat_mult | 5.49 ms       | 5.14 ms         | 1.07x faster | Significant (t=13.23) |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| spectral_norm           | 209 ms        | 199 ms          | 1.05x faster | Significant (t=4.10)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| sympy_expand            | 650 ms        | 638 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| sympy_integrate         | 30.8 ms       | 30.2 ms         | 1.02x faster | Significant (t=9.94)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| sympy_str               | 425 ms        | 407 ms          | 1.04x faster | Significant (t=4.26)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| sympy_sum               | 248 ms        | 248 ms          | 1.00x slower | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| telco                   | 9.63 ms       | 9.37 ms         | 1.03x faster | Significant (t=9.26)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| tornado_http            | 371 ms        | 369 ms          | 1.00x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| unpack_sequence         | 60.1 ns       | 59.0 ns         | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| unpickle                | 19.3 us       | 18.9 us         | 1.02x faster | Significant (t=8.65)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| unpickle_list           | 4.72 us       | 4.67 us         | 1.01x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| unpickle_pure_python    | 576 us        | 562 us          | 1.02x faster | Significant (t=7.27)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| xml_etree_generate      | 156 ms        | 153 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| xml_etree_iterparse     | 157 ms        | 146 ms          | 1.08x faster | Significant (t=6.27)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| xml_etree_parse         | 187 ms        | 183 ms          | 1.02x faster | Significant (t=4.14)  |
+-------------------------+---------------+-----------------+--------------+-----------------------+
| xml_etree_process       | 126 ms        | 123 ms          | 1.02x faster | Not significant       |
+-------------------------+---------------+-----------------+--------------+-----------------------+

For maintenance reasons, I'd really prefer it if we could find a way to reuse the existing code that calls an external stack probe function. What do you think about taking a look at X86RetpolineThunks.cpp and doing something similar to that? Basically, when the user sets -fstack-clash-protection, LLVM will emit a small comdat+weak function into every object file that has the same ABI as the existing stack probe mechanism. For other prior art, you can also look at how __clang_call_terminate works.

@rnk I 100% understand the reasoning, and will have a look to the files you're pointing at. AFAIU the approach you're suggesting is incompatible with the « free probe » algorithm that moves instruction in between stack allocations, but let me confirm that first.

Another test run on an x264 encoder (source: https://openbenchmarking.org/test/pts/x264)

Compiled with -O2 and with or without -fstack-clash-protection; Run without threads (x265 --pools 1 -F 1 ./Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m /dev/null)

Clang

baseline:  318.60s 
protection:  317.72s

So no performance impact beyond noise.

The compilation inserts 44 inline probes in 9 functions.

gcc

Out of comparison, with gcc 8.2 (yeah, it's a bit old), I get (same flags & setup)

baseline : 417.53 sec
protected : 412.6 sec

The compilations inserts inline probes in 22 functions.
So gcc inserts more probes, the impact on performance is equally surprising. I need to gather more points to get some statistical informations there.

Get rid of static mapping + update test cases

serge-sans-paille marked 6 inline comments as done.Oct 15 2019, 8:37 AM
dmajor added a subscriber: dmajor.Oct 16 2019, 10:44 AM

Moved the implementation to a specialization of emitStackProbeInline that used to be Windows-centric. Provide a generic implementation that generates inline assembly instead. that way we don't clutter existing functions, and only add a new case to emitSPUpdate.

Handle large stack allocation with a loop instead of a (large) basic block.

@rnk: this is far less intrusive and more integrated to the existing structure, thanks a lot for hinting in that direction.

Extra note: an older version of the patch has been tested by the firefox team without much performance impact, (and no test failure), see https://bugzilla.mozilla.org/show_bug.cgi?id=1588710

Explicilty prefer existing mechanism for windows.
Split loop/block allocation strategy to different functions.

You should also probably document the arg in
clang/docs/ClangCommandLineReference.rst
we have -fstack-protector-strong & co in the doc

Better integration with MachineInstr description
Handle stacks with multiple stack objects
More test case

Update documentation as suggested by @sylvestre.ledru
Corner case when a write was touching memory beyond the allocated stack.

Some benchmark / instrumentation report: due to the way memory moves are ordered in the entry block, there tend to be relatively few free probes between two stack growth within a function, and a large number after the last stack growth.

When recompiling llc, I only get 18 free probes between stack growth, and a large amount (several thousands) after the last stack growth.
When recompiling cpython, I only get 4 free probes between stack growth.

This may be improved by rescheduling instructions / changing stack object layout but that's probably not worth the effort (there's not that much functions with stack space > a page)

Temporarily comment out call support as free probe, everything else passes validation but a call may have some stack effect I don't handle (yet).

clang/docs/ClangCommandLineReference.rst
1916

Maybe add that it is Linux only? :)

serge-sans-paille edited the summary of this revision. (Show Details)

Moved to a simpler, easier to review, less error prone implementation. Validates just fines on llvm bootstrap an cpython code base. @sylvestre.ledru that's the patch you can play with.

@rnk what's your take on that version?

@sylvestre.ledru did the testing and benchmarking on firefox (see https://bugzilla.mozilla.org/show_bug.cgi?id=1588710#c12), everything seems ok, let's move forward?

@sylvestre.ledru did the testing and benchmarking on firefox (see https://bugzilla.mozilla.org/show_bug.cgi?id=1588710#c12), everything seems ok, let's move forward?

Actually, not sure about the benchmark anymore. I am checking with experts. Sorry

Added extra test for arch support (warns if the flag is used on unsupported targets, and checks it's effectively unused), cc @sylvestre.ledru

No, still waiting for @rnk feedback. This commit is just me using the wrong remote when pushing for validation.

For the record : this validates fine on OSX, Linux and Windows, using the github actions setup by @tstellar

https://github.com/serge-sans-paille/llvm-project/pull/3

@rnk up? the mozilla folks confirmed there's no significant regression (see https://bugzilla.mozilla.org/show_bug.cgi?id=1588710#c12).

craig.topper added inline comments.Jan 6 2020, 10:48 AM
clang/docs/ClangCommandLineReference.rst
1914

Probably need a -fno-stack-class-protection as well. Looks like gcc has it. You'll need to update the handling in the driver to make sure the last polarity wins. For cc1 you might be able to support just the positive variant. But try to see what we usually do.

clang/lib/Basic/Targets/X86.h
169

What about 32-bit mode where the register name is "esp"?

clang/lib/CodeGen/CGStmt.cpp
2255

Why is this in the frontend diagnostic list?

clang/lib/Driver/ToolChains/Clang.cpp
2999

Can we use EffectiveTriple.isX86() that was just introduced yesterday?

3008

Seems like this should just be an if? Or maybe use Args.filtered or args getLastArg?

tstellar added inline comments.Jan 9 2020, 1:05 PM
clang/lib/Driver/ToolChains/Clang.cpp
3001

Do we need to warn (or error) here for arches that don't implement -fstack-clash-protection?

Take review into account.

serge-sans-paille marked 9 inline comments as done.Jan 10 2020, 6:03 AM
serge-sans-paille added inline comments.
clang/lib/CodeGen/CGStmt.cpp
2255

As far as I understand, that's the only place where line/col accurate diagnostics are emitted.

clang/lib/Driver/ToolChains/Clang.cpp
3001

That's done indirectly because the argument is not claimed on unsupported platform.

@craig.topper do you think there is a chance that this change could be part of clang-10 ?
Thanks

craig.topper added inline comments.Jan 10 2020, 8:45 AM
clang/lib/CodeGen/CGStmt.cpp
2255

I was really asking about why we have to use FrontendDiagnostic.h and DriverDiagnostic.h? Is that what we normally use for diagnostics in Codegen files? Seems like a possible layering issue for modules build

serge-sans-paille marked 2 inline comments as done.Jan 10 2020, 9:50 AM
serge-sans-paille added inline comments.
clang/lib/CodeGen/CGStmt.cpp
2255

Oh, I understand now, let me have a look, your remark totally makes sense.

  • update warn_fe_stack_clash_protection_inline_asm location.
serge-sans-paille marked an inline comment as done.Jan 11 2020, 12:09 AM
craig.topper added inline comments.Jan 16 2020, 11:03 AM
clang/include/clang/Basic/DiagnosticCommonKinds.td
243

Remove "fe_" from this? Looks like thats only there because this used to be DiagnosticFrontendKinds?

clang/lib/Basic/Targets/X86.h
169

This probably doesn't work correctly with X32 which is 64-bit but uses 32-bit pointers. Maybe we change this to "bool isSPReg(StringRef Reg)" and pass the register name to it so you can just check both strings always.

clang/lib/Driver/ToolChains/Clang.cpp
3002

I think this can use Args.hasFlag(options::OPT_fstack_clash_protection, options::OPT_fnostack_clash_protection, false) and then you don't need the 'if'

llvm/lib/Target/X86/X86ISelLowering.cpp
31173

This uses physSPReg, but doesn't match the condition for when physSPReg is a 64-bit register. Same at several places below.

llvm/lib/Target/X86/X86InstrCompiler.td
130

Why is this In64BitMode, but above is NotLP64. Shouldn't they be opposites? Looks like this was just copied from SEG_ALLOCA above?

serge-sans-paille marked 5 inline comments as done.

Take into account some of the reviews

serge-sans-paille marked 2 inline comments as done.Jan 17 2020, 5:44 AM
serge-sans-paille added inline comments.
clang/lib/Driver/ToolChains/Clang.cpp
3002

I wasn't aware of that API, thanks for pointing this out o/

llvm/lib/Target/X86/X86ISelLowering.cpp
31173

Sorry, I don't understand your remark. Can you elaborate?

llvm/lib/Target/X86/X86InstrCompiler.td
130

Yeah, these intrinsics are just the same as the one above, except they have a different name. And they get lowered differently. I'm not familiar with that part of LLVM, can you suggest a better approach?

Unit tests: fail. 61960 tests passed, 3 failed and 783 were skipped.

failed: Clang.CodeGen/stack-clash-protection.c
failed: Clang.Driver/stack-clash-protection.c
failed: Clang.Misc/warning-flags.c

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

craig.topper added inline comments.Jan 17 2020, 10:31 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
31173

If physSPReg is RSP then you need to use SUB64ri32 and if its ESP you need to use SUB32ri. The condition you're using here is "IsLP64", but the condition for phySPReg to be RSP is "IsLP64 || Subtarget.isTargetNaCl64()". So there's a mismatch when IsLP64 is false and Subtarget.isTargetNaCl64() is true.

  • Harmonize esp/rsp usage as hinted by @craig.topper
  • Fix argument detection

Unit tests: fail. 62035 tests passed, 2 failed and 783 were skipped.

failed: Clang.CodeGen/stack-clash-protection.c
failed: Clang.Misc/warning-flags.c

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Update warning category

Unit tests: fail. 62059 tests passed, 1 failed and 783 were skipped.

failed: Clang.CodeGen/stack-clash-protection.c

clang-tidy: unknown.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Make some tests more portable.

Unit tests: pass. 62269 tests passed, 0 failed and 827 were skipped.

clang-tidy: fail. clang-tidy found 0 errors and 16 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

This revision is now accepted and ready to land.Feb 6 2020, 10:02 AM

Any plans to merge this feature to 10 release?

Any plans to merge this feature to 10 release?

I would love to have it in -10 for Firefox !

I would love too, @hans what do you think?

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Feb 7 2020, 11:08 AM

This breaks check-clang on mac and win: http://45.33.8.238/mac/7485/step_7.txt http://45.33.8.238/win/7753/step_7.txt

Please revert and then investigate asynchronously, unless the fix is obvious.

clang/test/CodeGen/stack-clash-protection.c
3

Tests should compile binaries and then run them if at all possible. This is impossible in cross-build scenarios and in general we try hard to have unit tests, not integration tests. Check that this produces the IR you expect, and have an llvm-level test that the IR produces the assembly you expect, and if you must, an lld level test that checks that the generated elf file looks like you think.

Reverted in b03c3d8c620 for now.

Also got failed clang tests on Arm/Aarch64 builders

http://lab.llvm.org:8011/builders/llvm-clang-win-x-armv7l/builds/4295
http://lab.llvm.org:8011/builders/llvm-clang-win-x-aarch64/builds/4277

  • FAIL: Clang::stack-clash-protection.c
  • FAIL: Clang::stack-clash-protection.c
error: unable to create target: 'No available targets are compatible with triple "x86_64"'

looks like REQUIRES: x86-registered-target keyword is needed.

For the record: tests have been updated, option handling cleaned up and expansive check failure fixed in commit e67cbac81211d40332a79d98c9d5953624cc1202.

sberg added a subscriber: sberg.Aug 10 2020, 12:13 PM
sberg added inline comments.
clang/include/clang/Driver/Options.td
1775

Should this rather spell "fno-stack-clash-protection"? The above change to clang/docs/ClangCommandLineReference.rst (but which got overwritten by https://github.com/llvm/llvm-project/commit/9624beb38a46037f69362650a52e06d8be4fd006 "[docs] Regenerate ClangCommandLineReference.rst") mentions -fno-stack-clash-protection, and also GCC calls it like that. (Though the below clang/test/Driver/stack-clash-protection.c does use -fnostack-clash-protection.)

xbolva00 added inline comments.Aug 10 2020, 12:18 PM
clang/include/clang/Driver/Options.td
1775

And the fix should be backported to release branch.

@hans

sberg added a comment.Aug 12 2020, 9:10 AM

I filed https://bugs.llvm.org/show_bug.cgi?id=47139 "Misspelled -fnostack-clash-protection" now.

kees reopened this revision.Aug 27 2020, 4:31 PM
kees added a subscriber: kees.

Sorry if I missed something here, but why is this marked as "Closed"? It seems like the feature has still not landed (i.e. it got reverted).

This revision is now accepted and ready to land.Aug 27 2020, 4:32 PM
efriedma closed this revision.Aug 27 2020, 4:38 PM

It was reverted, then re-landed. It's in trunk now.

kees added a comment.Aug 28 2020, 12:23 AM

Ah! Yes, I see it now. Thanks and sorry for the noise!

emaste added a subscriber: emaste.Nov 24 2020, 5:34 PM
emaste added inline comments.
clang/lib/Driver/ToolChains/Clang.cpp
2996–2997

Is there anything OS-dependent here?

I plan to add EffectiveTriple.isOSFreeBSD() to FreeBSD-s in-tree Clang
https://reviews.freebsd.org/D27366
and intend to upstream the change (with a test), but should we just remove the OS test completely?

clang/lib/Driver/ToolChains/Clang.cpp
2996–2997

According to https://blog.qualys.com/vulnerabilities-research/2017/06/19/the-stack-clash, stack clash impacts

Linux, OpenBSD, NetBSD, FreeBSD and Solaris

We should probably enlarge the test to these platforms.