Index: llvm/trunk/docs/SpeculativeLoadHardening.md =================================================================== --- llvm/trunk/docs/SpeculativeLoadHardening.md +++ llvm/trunk/docs/SpeculativeLoadHardening.md @@ -0,0 +1,1099 @@ +# Speculative Load Hardening + +### A Spectre Variant #1 Mitigation Technique + +Author: Chandler Carruth - [chandlerc@google.com](mailto:chandlerc@google.com) + +## Problem Statement + +Recently, Google Project Zero and other researchers have found information leak +vulnerabilities by exploiting speculative execution in modern CPUs. These +exploits are currently broken down into three variants: +* GPZ Variant #1 (a.k.a. Spectre Variant #1): Bounds check (or predicate) bypass +* GPZ Variant #2 (a.k.a. Spectre Variant #2): Branch target injection +* GPZ Variant #3 (a.k.a. Meltdown): Rogue data cache load + +For more details, see the Google Project Zero blog post and the Spectre research +paper: +* https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html +* https://spectreattack.com/spectre.pdf + +The core problem of GPZ Variant #1 is that speculative execution uses branch +prediction to select the path of instructions speculatively executed. This path +is speculatively executed with the available data, and may load from memory and +leak the loaded values through various side channels that survive even when the +speculative execution is unwound due to being incorrect. Mispredicted paths can +cause code to be executed with data inputs that never occur in correct +executions, making checks against malicious inputs ineffective and allowing +attackers to use malicious data inputs to leak secret data. Here is an example, +extracted and simplified from the Project Zero paper: +``` +struct array { + unsigned long length; + unsigned char data[]; +}; +struct array *arr1 = ...; // small array +struct array *arr2 = ...; // array of size 0x400 +unsigned long untrusted_offset_from_caller = ...; +if (untrusted_offset_from_caller < arr1->length) { + unsigned char value = arr1->data[untrusted_offset_from_caller]; + unsigned long index2 = ((value&1)*0x100)+0x200; + unsigned char value2 = arr2->data[index2]; +} +``` + +The key of the attack is to call this with `untrusted_offset_from_caller` that +is far outside of the bounds when the branch predictor will predict that it +will be in-bounds. In that case, the body of the `if` will be executed +speculatively, and may read secret data into `value` and leak it via a +cache-timing side channel when a dependent access is made to populate `value2`. + +## High Level Mitigation Approach + +While several approaches are being actively pursued to mitigate specific +branches and/or loads inside especially risky software (most notably various OS +kernels), these approaches require manual and/or static analysis aided auditing +of code and explicit source changes to apply the mitigation. They are unlikely +to scale well to large applications. We are proposing a comprehensive +mitigation approach that would apply automatically across an entire program +rather than through manual changes to the code. While this is likely to have a +high performance cost, some applications may be in a good position to take this +performance / security tradeoff. + +The specific technique we propose is to cause loads to be checked using +branchless code to ensure that they are executing along a valid control flow +path. Consider the following C-pseudo-code representing the core idea of a +predicate guarding potentially invalid loads: +``` +void leak(int data); +void example(int* pointer1, int* pointer2) { + if (condition) { + // ... lots of code ... + leak(*pointer1); + } else { + // ... more code ... + leak(*pointer2); + } +} +``` + +This would get transformed into something resembling the following: +``` +uintptr_t all_ones_mask = std::numerical_limits::max(); +uintptr_t all_zeros_mask = 0; +void leak(int data); +void example(int* pointer1, int* pointer2) { + uintptr_t predicate_state = all_ones_mask; + if (condition) { + // Assuming ?: is implemented using branchless logic... + predicate_state = !condition ? all_zeros_mask : predicate_state; + // ... lots of code ... + // + // Harden the pointer so it can't be loaded + pointer1 &= predicate_state; + leak(*pointer1); + } else { + predicate_state = condition ? all_zeros_mask : predicate_state; + // ... more code ... + // + // Alternative: Harden the loaded value + int value2 = *pointer2 & predicate_state; + leak(value2); + } +} +``` + +The result should be that if the `if (condition) {` branch is mis-predicted, +there is a *data* dependency on the condition used to zero out any pointers +prior to loading through them or to zero out all of the loaded bits. Even +though this code pattern may still execute speculatively, *invalid* speculative +executions are prevented from leaking secret data from memory (but note that +this data might still be loaded in safe ways, and some regions of memory are +required to not hold secrets, see below for detailed limitations). This +approach only requires the underlying hardware have a way to implement a +branchless and unpredicted conditional update of a register's value. All modern +architectures have support for this, and in fact such support is necessary to +correctly implement constant time cryptographic primitives. + +Crucial properties of this approach: +* It is not preventing any particular side-channel from working. This is + important as there are an unknown number of potential side channels and we + expect to continue discovering more. Instead, it prevents the observation of + secret data in the first place. +* It accumulates the predicate state, protecting even in the face of nested + *correctly* predicted control flows. +* It passes this predicate state across function boundaries to provide + [interprocedural protection](#interprocedural-checking). +* When hardening the address of a load, it uses a *destructive* or + *non-reversible* modification of the address to prevent an attacker from + reversing the check using attacker-controlled inputs. +* It does not completely block speculative execution, and merely prevents + *mis*-speculated paths from leaking secrets from memory (and stalls + speculation until this can be determined). +* It is completely general and makes no fundamental assumptions about the + underlying architecture other than the ability to do branchless conditional + data updates and a lack of value prediction. +* It does not require programmers to identify all possible secret data using + static source code annotations or code vulnerable to a variant #1 style + attack. + +Limitations of this approach: +* It requires re-compiling source code to insert hardening instruction + sequences. Only software compiled in this mode is protected. +* The performance is heavily dependent on a particular architecture's + implementation strategy. We outline a potential x86 implementation below and + characterize its performance. +* It does not defend against secret data already loaded from memory and + residing in registers or leaked through other side-channels in + non-speculative execution. Code dealing with this, e.g cryptographic + routines, already uses constant-time algorithms and code to prevent + side-channels. Such code should also scrub registers of secret data following + [these + guidelines](https://github.com/HACS-workshop/spectre-mitigations/blob/master/crypto_guidelines.md). +* To achieve reasonable performance, many loads may not be checked, such as + those with compile-time fixed addresses. This primarily consists of accesses + at compile-time constant offsets of global and local variables. Code which + needs this protection and intentionally stores secret data must ensure the + memory regions used for secret data are necessarily dynamic mappings or heap + allocations. This is an area which can be tuned to provide more comprehensive + protection at the cost of performance. +* [Hardened loads](#hardening-the-address-of-the-load) may still load data from + _valid_ addresses if not _attacker-controlled_ addresses. To prevent these + from reading secret data, the low 2gb of the address space and 2gb above and + below any executable pages should be protected. + +Credit: +* The core idea of tracing misspeculation through data and marking pointers to + block misspeculated loads was developed as part of a HACS 2018 discussion + between Chandler Carruth, Paul Kocher, Thomas Pornin, and several other + individuals. +* Core idea of masking out loaded bits was part of the original mitigation + suggested by Jann Horn when these attacks were reported. + + +### Indirect Branches, Calls, and Returns + +It is possible to attack control flow other than conditional branches with +variant #1 style mispredictions. +* A prediction towards a hot call target of a virtual method can lead to it + being speculatively executed when an expected type is used (often called + "type confusion"). +* A hot case may be speculatively executed due to prediction instead of the + correct case for a switch statement implemented as a jump table. +* A hot common return address may be predicted incorrectly when returning from + a function. + +These code patterns are also vulnerable to Spectre variant #2, and as such are +best mitigated with a +[retpoline](https://support.google.com/faqs/answer/7625886) on x86 platforms. +When a mitigation technique like retpoline is used, speculation simply cannot +proceed through an indirect control flow edge (or it cannot be mispredicted in +the case of a filled RSB) and so it is also protected from variant #1 style +attacks. However, some architectures, micro-architectures, or vendors do not +employ the retpoline mitigation, and on future x86 hardware (both Intel and +AMD) it is expected to become unnecessary due to hardware-based mitigation. + +When not using a retpoline, these edges will need independent protection from +variant #1 style attacks. The analogous approach to that used for conditional +control flow should work: +``` +uintptr_t all_ones_mask = std::numerical_limits::max(); +uintptr_t all_zeros_mask = 0; +void leak(int data); +void example(int* pointer1, int* pointer2) { + uintptr_t predicate_state = all_ones_mask; + switch (condition) { + case 0: + // Assuming ?: is implemented using branchless logic... + predicate_state = (condition != 0) ? all_zeros_mask : predicate_state; + // ... lots of code ... + // + // Harden the pointer so it can't be loaded + pointer1 &= predicate_state; + leak(*pointer1); + break; + + case 1: + predicate_state = (condition != 1) ? all_zeros_mask : predicate_state; + // ... more code ... + // + // Alternative: Harden the loaded value + int value2 = *pointer2 & predicate_state; + leak(value2); + break; + + // ... + } +} +``` + +The core idea remains the same: validate the control flow using data-flow and +use that validation to check that loads cannot leak information along +misspeculated paths. Typically this involves passing the desired target of such +control flow across the edge and checking that it is correct afterwards. Note +that while it is tempting to think that this mitigates variant #2 attacks, it +does not. Those attacks go to arbitrary gadgets that don't include the checks. + + +### Variant #1.1 and #1.2 attacks: "Bounds Check Bypass Store" + +Beyond the core variant #1 attack, there are techniques to extend this attack. +The primary technique is known as "Bounds Check Bypass Store" and is discussed +in this research paper: https://people.csail.mit.edu/vlk/spectre11.pdf + +We will analyze these two variants independently. First, variant #1.1 works by +speculatively storing over the return address after a bounds check bypass. This +speculative store then ends up being used by the CPU during speculative +execution of the return, potentially directing speculative execution to +arbitrary gadgets in the binary. Let's look at an example. +``` +unsigned char local_buffer[4]; +unsigned char *untrusted_data_from_caller = ...; +unsigned long untrusted_size_from_caller = ...; +if (untrusted_size_from_caller < sizeof(local_buffer)) { + // Speculative execution enters here with a too-large size. + memcpy(local_buffer, untrusted_data_from_caller, + untrusted_size_from_caller); + // The stack has now been smashed, writing an attacker-controlled + // address over the return adress. + minor_processing(local_buffer); + return; + // Control will speculate to the attacker-written address. +} +``` + +However, this can be mitigated by hardening the load of the return address just +like any other load. This is sometimes complicated because x86 for example +*implicitly* loads the return address off the stack. However, the +implementation technique below is specifically designed to mitigate this +implicit load by using the stack pointer to communicate misspeculation between +functions. This additionally causes a misspeculation to have an invalid stack +pointer and never be able to read the speculatively stored return address. See +the detailed discussion below. + +For variant #1.2, the attacker speculatively stores into the vtable or jump +table used to implement an indirect call or indirect jump. Because this is +speculative, this will often be possible even when these are stored in +read-only pages. For example: +``` +class FancyObject : public BaseObject { +public: + void DoSomething() override; +}; +void f(unsigned long attacker_offset, unsigned long attacker_data) { + FancyObject object = getMyObject(); + unsigned long *arr[4] = getFourDataPointers(); + if (attacker_offset < 4) { + // We have bypassed the bounds check speculatively. + unsigned long *data = arr[attacker_offset]; + // Now we have computed a pointer inside of `object`, the vptr. + *data = attacker_data; + // The vptr points to the virtual table and we speculatively clobber that. + g(object); // Hand the object to some other routine. + } +} +// In another file, we call a method on the object. +void g(BaseObject &object) { + object.DoSomething(); + // This speculatively calls the address stored over the vtable. +} +``` + +Mitigating this requires hardening loads from these locations, or mitigating +the indirect call or indirect jump. Any of these are sufficient to block the +call or jump from using a speculatively stored value that has been read back. + +For both of these, using retpolines would be equally sufficient. One possible +hybrid approach is to use retpolines for indirect call and jump, while relying +on SLH to mitigate returns. + +Another approach that is sufficient for both of these is to harden all of the +speculative stores. However, as most stores aren't interesting and don't +inherently leak data, this is expected to be prohibitively expensive given the +attack it is defending against. + + +## Implementation Details + +There are a number of complex details impacting the implementation of this +technique, both on a particular architecture and within a particular compiler. +We discuss proposed implementation techniques for the x86 architecture and the +LLVM compiler. These are primarily to serve as an example, as other +implementation techniques are very possible. + + +### x86 Implementation Details + +On the x86 platform we break down the implementation into three core +components: accumulating the predicate state through the control flow graph, +checking the loads, and checking control transfers between procedures. + + +#### Accumulating Predicate State + +Consider baseline x86 instructions like the following, which test three +conditions and if all pass, loads data from memory and potentially leaks it +through some side channel: +``` +# %bb.0: # %entry + pushq %rax + testl %edi, %edi + jne .LBB0_4 +# %bb.1: # %then1 + testl %esi, %esi + jne .LBB0_4 +# %bb.2: # %then2 + testl %edx, %edx + je .LBB0_3 +.LBB0_4: # %exit + popq %rax + retq +.LBB0_3: # %danger + movl (%rcx), %edi + callq leak + popq %rax + retq +``` + +When we go to speculatively execute the load, we want to know whether any of +the dynamically executed predicates have been misspeculated. To track that, +along each conditional edge, we need to track the data which would allow that +edge to be taken. On x86, this data is stored in the flags register used by the +conditional jump instruction. Along both edges after this fork in control flow, +the flags register remains alive and contains data that we can use to build up +our accumulated predicate state. We accumulate it using the x86 conditional +move instruction which also reads the flag registers where the state resides. +These conditional move instructions are known to not be predicted on any x86 +processors, making them immune to misprediction that could reintroduce the +vulnerability. When we insert the conditional moves, the code ends up looking +like the following: +``` +# %bb.0: # %entry + pushq %rax + xorl %eax, %eax # Zero out initial predicate state. + movq $-1, %r8 # Put all-ones mask into a register. + testl %edi, %edi + jne .LBB0_1 +# %bb.2: # %then1 + cmovneq %r8, %rax # Conditionally update predicate state. + testl %esi, %esi + jne .LBB0_1 +# %bb.3: # %then2 + cmovneq %r8, %rax # Conditionally update predicate state. + testl %edx, %edx + je .LBB0_4 +.LBB0_1: + cmoveq %r8, %rax # Conditionally update predicate state. + popq %rax + retq +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + ... +``` + +Here we create the "empty" or "correct execution" predicate state by zeroing +`%rax`, and we create a constant "incorrect execution" predicate value by +putting `-1` into `%r8`. Then, along each edge coming out of a conditional +branch we do a conditional move that in a correct execution will be a no-op, +but if misspeculated, will replace the `%rax` with the value of `%r8`. +Misspeculating any one of the three predicates will cause `%rax` to hold the +"incorrect execution" value from `%r8` as we preserve incoming values when +execution is correct rather than overwriting it. + +We now have a value in `%rax` in each basic block that indicates if at some +point previously a predicate was mispredicted. And we have arranged for that +value to be particularly effective when used below to harden loads. + + +##### Indirect Call, Branch, and Return Predicates + +(Not yet implemented.) + +There is no analogous flag to use when tracing indirect calls, branches, and +returns. The predicate state must be accumulated through some other means. +Fundamentally, this is the reverse of the problem posed in CFI: we need to +check where we came from rather than where we are going. For function-local +jump tables, this is easily arranged by testing the input to the jump table +within each destination: +``` + pushq %rax + xorl %eax, %eax # Zero out initial predicate state. + movq $-1, %r8 # Put all-ones mask into a register. + jmpq *.LJTI0_0(,%rdi,8) # Indirect jump through table. +.LBB0_2: # %sw.bb + testq $0, %rdi # Validate index used for jump table. + cmovneq %r8, %rax # Conditionally update predicate state. + ... + jmp _Z4leaki # TAILCALL + +.LBB0_3: # %sw.bb1 + testq $1, %rdi # Validate index used for jump table. + cmovneq %r8, %rax # Conditionally update predicate state. + ... + jmp _Z4leaki # TAILCALL + +.LBB0_5: # %sw.bb10 + testq $2, %rdi # Validate index used for jump table. + cmovneq %r8, %rax # Conditionally update predicate state. + ... + jmp _Z4leaki # TAILCALL + ... + + .section .rodata,"a",@progbits + .p2align 3 +.LJTI0_0: + .quad .LBB0_2 + .quad .LBB0_3 + .quad .LBB0_5 + ... +``` + +Returns have a simple mitigation technique on x86-64 (or other ABIs which have +what is called a "red zone" region beyond the end of the stack). This region is +guaranteed to be preserved across interrupts and context switches, making the +return address used in returning to the current code remain on the stack and +valid to read. We can emit code in the caller to verify that a return edge was +not mispredicted: +``` + callq other_function +return_addr: + testq -8(%rsp), return_addr # Validate return address. + cmovneq %r8, %rax # Update predicate state. +``` + +For an ABI without a "red zone" (and thus unable to read the return address +from the stack), mitigating returns face similar problems to calls below. + +Indirect calls (and returns in the absence of a red zone ABI) pose the most +significant challenge to propagate. The simplest technique would be to define a +new ABI such that the intended call target is passed into the called function +and checked in the entry. Unfortunately, new ABIs are quite expensive to deploy +in C and C++. While the target function could be passed in TLS, we would still +require complex logic to handle a mixture of functions compiled with and +without this extra logic (essentially, making the ABI backwards compatible). +Currently, we suggest using retpolines here and will continue to investigate +ways of mitigating this. + + +##### Optimizations, Alternatives, and Tradeoffs + +Merely accumulating predicate state involves significant cost. There are +several key optimizations we employ to minimize this and various alternatives +that present different tradeoffs in the generated code. + +First, we work to reduce the number of instructions used to track the state: +* Rather than inserting a `cmovCC` instruction along every conditional edge in + the original program, we track each set of condition flags we need to capture + prior to entering each basic block and reuse a common `cmovCC` sequence for + those. + * We could further reuse suffixes when there are multiple `cmovCC` + instructions required to capture the set of flags. Currently this is + believed to not be worth the cost as paired flags are relatively rare and + suffixes of them are exceedingly rare. +* A common pattern in x86 is to have multiple conditional jump instructions + that use the same flags but handle different conditions. Naively, we could + consider each fallthrough between them an "edge" but this causes a much more + complex control flow graph. Instead, we accumulate the set of conditions + necessary for fallthrough and use a sequence of `cmovCC` instructions in a + single fallthrough edge to track it. + +Second, we trade register pressure for simpler `cmovCC` instructions by +allocating a register for the "bad" state. We could read that value from memory +as part of the conditional move instruction, however, this creates more +micro-ops and requires the load-store unit to be involved. Currently, we place +the value into a virtual register and allow the register allocator to decide +when the register pressure is sufficient to make it worth spilling to memory +and reloading. + + +#### Hardening Loads + +Once we have the predicate accumulated into a special value for correct vs. +misspeculated, we need to apply this to loads in a way that ensures they do not +leak secret data. There are two primary techniques for this: we can either +harden the loaded value to prevent observation, or we can harden the address +itself to prevent the load from occuring. These have significantly different +performance tradeoffs. + + +##### Hardening loaded values + +The most appealing way to harden loads is to mask out all of the bits loaded. +The key requirement is that for each bit loaded, along the misspeculated path +that bit is always fixed at either 0 or 1 regardless of the value of the bit +loaded. The most obvious implementation uses either an `and` instruction with +an all-zero mask along misspeculated paths and an all-one mask along correct +paths, or an `or` instruction with an all-one mask along misspeculated paths +and an all-zero mask along correct paths. Other options become less appealing +such as multiplying by zero, or multiple shift instructions. For reasons we +elaborate on below, we end up suggesting you use `or` with an all-ones mask, +making the x86 instruction sequence look like the following: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + movl (%rsi), %edi # Load potentially secret data from %rsi. + orl %eax, %edi +``` + +Other useful patterns may be to fold the load into the `or` instruction itself +at the cost of a register-to-register copy. + +There are some challenges with deploying this approach: +1. Many loads on x86 are folded into other instructions. Separating them would + add very significant and costly register pressure with prohibitive + performance cost. +1. Loads may not target a general purpose register requiring extra instructions + to map the state value into the correct register class, and potentially more + expensive instructions to mask the value in some way. +1. The flags registers on x86 are very likely to be live, and challenging to + preserve cheaply. +1. There are many more values loaded than pointers & indices used for loads. As + a consequence, hardening the result of a load requires substantially more + instructions than hardening the address of the load (see below). + +Despite these challenges, hardening the result of the load critically allows +the load to proceed and thus has dramatically less impact on the total +speculative / out-of-order potential of the execution. There are also several +interesting techniques to try and mitigate these challenges and make hardening +the results of loads viable in at least some cases. However, we generally +expect to fall back when unprofitable from hardening the loaded value to the +next approach of hardening the address itself. + + +###### Loads folded into data-invariant operations can be hardened after the operation + +The first key to making this feasible is to recognize that many operations on +x86 are "data-invariant". That is, they have no (known) observable behavior +differences due to the particular input data. These instructions are often used +when implementing cryptographic primitives dealing with private key data +because they are not believed to provide any side-channels. Similarly, we can +defer hardening until after them as they will not in-and-of-themselves +introduce a speculative execution side-channel. This results in code sequences +that look like: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + addl (%rsi), %edi # Load and accumulate without leaking. + orl %eax, %edi +``` + +While an addition happens to the loaded (potentially secret) value, that +doesn't leak any data and we then immediately harden it. + + +###### Hardening of loaded values deferred down the data-invariant expression graph + +We can generalize the previous idea and sink the hardening down the expression +graph across as many data-invariant operations as desirable. This can use very +conservative rules for whether something is data-invariant. The primary goal +should be to handle multiple loads with a single hardening instruction: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + addl (%rsi), %edi # Load and accumulate without leaking. + addl 4(%rsi), %edi # Continue without leaking. + addl 8(%rsi), %edi + orl %eax, %edi # Mask out bits from all three loads. +``` + + +###### Preserving the flags while hardening loaded values on Haswell, Zen, and newer processors + +Sadly, there are no useful instructions on x86 that apply a mask to all 64 bits +without touching the flag registers. However, we can harden loaded values that +are narrower than a word (fewer than 32-bits on 32-bit systems and fewer than +64-bits on 64-bit systems) by zero-extending the value to the full word size +and then shifting right by at least the number of original bits using the BMI2 +`shrx` instruction: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + addl (%rsi), %edi # Load and accumulate 32 bits of data. + shrxq %rax, %rdi, %rdi # Shift out all 32 bits loaded. +``` + +Because on x86 the zero-extend is free, this can efficiently harden the loaded +value. + + +##### Hardening the address of the load + +When hardening the loaded value is inapplicable, most often because the +instruction directly leaks information (like `cmp` or `jmpq`), we switch to +hardening the _address_ of the load instead of the loaded value. This avoids +increasing register pressure by unfolding the load or paying some other high +cost. + +To understand how this works in practice, we need to examine the exact +semantics of the x86 addressing modes which, in its fully general form, looks +like `(%base,%index,scale)offset`. Here `%base` and `%index` are 64-bit +registers that can potentially be any value, and may be attacker controlled, +and `scale` and `offset` are fixed immediate values. `scale` must be `1`, `2`, +`4`, or `8`, and `offset` can be any 32-bit sign extended value. The exact +computation performed to find the address is then: `%base + (scale * %index) + +offset` under 64-bit 2's complement modular arithmetic. + +One issue with this approach is that, after hardening, the `%base + (scale * +%index)` subexpression will compute a value near zero (`-1 + (scale * -1)`) and +then a large, positive `offset` will index into memory within the first two +gigabytes of address space. While these offsets are not attacker controlled, +the attacker could chose to attack a load which happens to have the desired +offset and then successfully read memory in that region. This significantly +raises the burden on the attacker and limits the scope of attack but does not +eliminate it. To fully close the attack we must work with the operating system +to preclude mapping memory in the low two gigabytes of address space. + + +###### 64-bit load checking instructions + +We can use the following instruction sequences to check loads. We set up `%r8` +in these examples to hold the special value of `-1` which will be `cmov`ed over +`%rax` in misspeculated paths. + +Single register addressing mode: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + orq %rax, %rsi # Mask the pointer if misspeculating. + movl (%rsi), %edi +``` + +Two register addressing mode: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + orq %rax, %rsi # Mask the pointer if misspeculating. + orq %rax, %rcx # Mask the index if misspeculating. + movl (%rsi,%rcx), %edi +``` + +This will result in a negative address near zero or in `offset` wrapping the +address space back to a small positive address. Small, negative addresses will +fault in user-mode for most operating systems, but targets which need the high +address space to be user accessible may need to adjust the exact sequence used +above. Additionally, the low addresses will need to be marked unreadable by the +OS to fully harden the load. + + +###### RIP-relative addressing is even easier to break + +There is a common addressing mode idiom that is substantially harder to check: +addressing relative to the instruction pointer. We cannot change the value of +the instruction pointer register and so we have the harder problem of forcing +`%base + scale * %index + offset` to be an invalid address, by *only* changing +`%index`. The only advantage we have is that the attacker also cannot modify +`%base`. If we use the fast instruction sequence above, but only apply it to +the index, we will always access `%rip + (scale * -1) + offset`. If the +attacker can find a load which with this address happens to point to secret +data, then they can reach it. However, the loader and base libraries can also +simply refuse to map the heap, data segments, or stack within 2gb of any of the +text in the program, much like it can reserve the low 2gb of address space. + + +###### The flag registers again make everything hard + +Unfortunately, the technique of using `orq`-instructions has a serious flaw on +x86. The very thing that makes it easy to accumulate state, the flag registers +containing predicates, causes serious problems here because they may be alive +and used by the loading instruction or subsequent instructions. On x86, the +`orq` instruction **sets** the flags and will override anything already there. +This makes inserting them into the instruction stream very hazardous. +Unfortunately, unlike when hardening the loaded value, we have no fallback here +and so we must have a fully general approach available. + +The first thing we must do when generating these sequences is try to analyze +the surrounding code to prove that the flags are not in fact alive or being +used. Typically, it has been set by some other instruction which just happens +to set the flags register (much like ours!) with no actual dependency. In those +cases, it is safe to directly insert these instructions. Alternatively we may +be able to move them earlier to avoid clobbering the used value. + +However, this may ultimately be impossible. In that case, we need to preserve +the flags around these instructions: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + pushfq + orq %rax, %rcx # Mask the pointer if misspeculating. + orq %rax, %rdx # Mask the index if misspeculating. + popfq + movl (%rcx,%rdx), %edi +``` + +Using the `pushf` and `popf` instructions saves the flags register around our +inserted code, but comes at a high cost. First, we must store the flags to the +stack and reload them. Second, this causes the stack pointer to be adjusted +dynamically, requiring a frame pointer be used for referring to temporaries +spilled to the stack, etc. + +On newer x86 processors we can use the `lahf` and `sahf` instructions to save +all of the flags besides the overflow flag in a register rather than on the +stack. We can then use `seto` and `add` to save and restore the overflow flag +in a register. Combined, this will save and restore flags in the same manner as +above but using two registers rather than the stack. That is still very +expensive if slightly less expensive than `pushf` and `popf` in most cases. + + +###### A flag-less alternative on Haswell, Zen and newer processors + +Starting with the BMI2 x86 instruction set extensions available on Haswell and +Zen processors, there is an instruction for shifting that does not set any +flags: `shrx`. We can use this and the `lea` instruction to implement analogous +code sequences to the above ones. However, these are still very marginally +slower, as there are fewer ports able to dispatch shift instructions in most +modern x86 processors than there are for `or` instructions. + +Fast, single register addressing mode: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + shrxq %rax, %rsi, %rsi # Shift away bits if misspeculating. + movl (%rsi), %edi +``` + +This will collapse the register to zero or one, and everything but the offset +in the addressing mode to be less than or equal to 9. This means the full +address can only be guaranteed to be less than `(1 << 31) + 9`. The OS may wish +to protect an extra page of the low address space to account for this + + +##### Optimizations + +A very large portion of the cost for this approach comes from checking loads in +this way, so it is important to work to optimize this. However, beyond making +the instruction sequences to *apply* the checks efficient (for example by +avoiding `pushfq` and `popfq` sequences), the only significant optimization is +to check fewer loads without introducing a vulnerability. We apply several +techniques to accomplish that. + + +###### Don't check loads from compile-time constant stack offsets + +We implement this optimization on x86 by skipping the checking of loads which +use a fixed frame pointer offset. + +The result of this optimization is that patterns like reloading a spilled +register or accessing a global field don't get checked. This is a very +significant performance win. + + +###### Don't check dependent loads + +A core part of why this mitigation strategy works is that it establishes a +data-flow check on the loaded address. However, this means that if the address +itself was already loaded using a checked load, there is no need to check a +dependent load provided it is within the same basic block as the checked load, +and therefore has no additional predicates guarding it. Consider code like the +following: +``` + ... + +.LBB0_4: # %danger + movq (%rcx), %rdi + movl (%rdi), %edx +``` + +This will get transformed into: +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + orq %rax, %rcx # Mask the pointer if misspeculating. + movq (%rcx), %rdi # Hardened load. + movl (%rdi), %edx # Unhardened load due to dependent addr. +``` + +This doesn't check the load through `%rdi` as that pointer is dependent on a +checked load already. + + +###### Protect large, load-heavy blocks with a single lfence + +It may be worth using a single `lfence` instruction at the start of a block +which begins with a (very) large number of loads that require independent +protection *and* which require hardening the address of the load. However, this +is unlikely to be profitable in practice. The latency hit of the hardening +would need to exceed that of an `lfence` when *correctly* speculatively +executed. But in that case, the `lfence` cost is a complete loss of speculative +execution (at a minimum). So far, the evidence we have of the performance cost +of using `lfence` indicates few if any hot code patterns where this trade off +would make sense. + + +###### Tempting optimizations that break the security model + +Several optimizations were considered which didn't pan out due to failure to +uphold the security model. One in particular is worth discussing as many others +will reduce to it. + +We wondered whether only the *first* load in a basic block could be checked. If +the check works as intended, it forms an invalid pointer that doesn't even +virtual-address translate in the hardware. It should fault very early on in its +processing. Maybe that would stop things in time for the misspeculated path to +fail to leak any secrets. This doesn't end up working because the processor is +fundamentally out-of-order, even in its speculative domain. As a consequence, +the attacker could cause the initial address computation itself to stall and +allow an arbitrary number of unrelated loads (including attacked loads of +secret data) to pass through. + + +#### Interprocedural Checking + +Modern x86 processors may speculate into called functions and out of functions +to their return address. As a consequence, we need a way to check loads that +occur after a misspeculated predicate but where the load and the misspeculated +predicate are in different functions. In essence, we need some interprocedural +generalization of the predicate state tracking. A primary challenge to passing +the predicate state between functions is that we would like to not require a +change to the ABI or calling convention in order to make this mitigation more +deployable, and further would like code mitigated in this way to be easily +mixed with code not mitigated in this way and without completely losing the +value of the mitigation. + + +##### Embed the predicate state into the high bit(s) of the stack pointer + +We can use the same technique that allows hardening pointers to pass the +predicate state into and out of functions. The stack pointer is trivially +passed between functions and we can test for it having the high bits set to +detect when it has been marked due to misspeculation. The callsite instruction +sequence looks like (assuming a misspeculated state value of `-1`): +``` + ... + +.LBB0_4: # %danger + cmovneq %r8, %rax # Conditionally update predicate state. + shlq $47, %rax + orq %rax, %rsp + callq other_function + movq %rsp, %rax + sarq 63, %rax # Sign extend the high bit to all bits. +``` + +This first puts the predicate state into the high bits of `%rsp` before calling +the function and then reads it back out of high bits of `%rsp` afterward. When +correctly executing (speculatively or not), these are all no-ops. When +misspeculating, the stack pointer will end up negative. We arrange for it to +remain a canonical address, but otherwise leave the low bits alone to allow +stack adjustments to proceed normally without disrupting this. Within the +called function, we can extract this predicate state and then reset it on +return: +``` +other_function: + # prolog + callq other_function + movq %rsp, %rax + sarq 63, %rax # Sign extend the high bit to all bits. + # ... + +.LBB0_N: + cmovneq %r8, %rax # Conditionally update predicate state. + shlq $47, %rax + orq %rax, %rsp + retq +``` + +This approach is effective when all code is mitigated in this fashion, and can +even survive very limited reaches into unmitigated code (the state will +round-trip in and back out of an unmitigated function, it just won't be +updated). But it does have some limitations. There is a cost to merging the +state into `%rsp` and it doesn't insulate mitigated code from misspeculation in +an unmitigated caller. + +There is also an advantage to using this form of interprocedural mitigation: by +forming these invalid stack pointer addresses we can prevent speculative +returns from successfully reading speculatively written values to the actual +stack. This works first by forming a data-dependency between computing the +address of the return address on the stack and our predicate state. And even +when satisfied, if a misprediction causes the state to be poisoned the +resulting stack pointer will be invalid. + + +##### Rewrite API of internal functions to directly propagate predicate state + +(Not yet implemented.) + +We have the option with internal functions to directly adjust their API to +accept the predicate as an argument and return it. This is likely to be +marginally cheaper than embedding into `%rsp` for entering functions. + + +##### Use `lfence` to guard function transitions + +An `lfence` instruction can be used to prevent subsequent loads from +speculatively executing until all prior mispredicted predicates have resolved. +We can use this broader barrier to speculative loads executing between +functions. We emit it in the entry block to handle calls, and prior to each +return. This approach also has the advantage of providing the strongest degree +of mitigation when mixed with unmitigated code by halting all misspeculation +entering a function which is mitigated, regardless of what occured in the +caller. However, such a mixture is inherently more risky. Whether this kind of +mixture is a sufficient mitigation requires careful analysis. + +Unfortunately, experimental results indicate that the performance overhead of +this approach is very high for certain patterns of code. A classic example is +any form of recursive evaluation engine. The hot, rapid call and return +sequences exhibit dramatic performance loss when mitigated with `lfence`. This +component alone can regress performance by 2x or more, making it an unpleasant +tradeoff even when only used in a mixture of code. + + +##### Use an internal TLS location to pass predicate state + +We can define a special thread-local value to hold the predicate state between +functions. This avoids direct ABI implications by using a side channel between +callers and callees to communicate the predicate state. It also allows implicit +zero-initialization of the state, which allows non-checked code to be the first +code executed. + +However, this requires a load from TLS in the entry block, a store to TLS +before every call and every ret, and a load from TLS after every call. As a +consequence it is expected to be substantially more expensive even than using +`%rsp` and potentially `lfence` within the function entry block. + + +##### Define a new ABI and/or calling convention + +We could define a new ABI and/or calling convention to explicitly pass the +predicate state in and out of functions. This may be interesting if none of the +alternatives have adequate performance, but it makes deployment and adoption +dramatically more complex, and potentially infeasible. + + +## High-Level Alternative Mitigation Strategies + +There are completely different alternative approaches to mitigating variant 1 +attacks. [Most](https://lwn.net/Articles/743265/) +[discussion](https://lwn.net/Articles/744287/) so far focuses on mitigating +specific known attackable components in the Linux kernel (or other kernels) by +manually rewriting the code to contain an instruction sequence that is not +vulnerable. For x86 systems this is done by either injecting an `lfence` +instruction along the code path which would leak data if executed speculatively +or by rewriting memory accesses to have branch-less masking to a known safe +region. On Intel systems, `lfence` [will prevent the speculative load of secret +data](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf). +On AMD systems `lfence` is currently a no-op, but can be made +dispatch-serializing by setting an MSR, and thus preclude misspeculation of the +code path ([mitigation G-2 + +V1-1](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf)). + +However, this relies on finding and enumerating all possible points in code +which could be attacked to leak information. While in some cases static +analysis is effective at doing this at scale, in many cases it still relies on +human judgement to evaluate whether code might be vulnerable. Especially for +software systems which receive less detailed scrutiny but remain sensitive to +these attacks, this seems like an impractical security model. We need an +automatic and systematic mitigation strategy. + + +### Automatic `lfence` on Conditional Edges + +A natural way to scale up the existing hand-coded mitigations is simply to +inject an `lfence` instruction into both the target and fallthrough +destinations of every conditional branch. This ensures that no predicate or +bounds check can be bypassed speculatively. However, the performance overhead +of this approach is, simply put, catastrophic. Yet it remains the only truly +"secure by default" approach known prior to this effort and serves as the +baseline for performance. + +One attempt to address the performance overhead of this and make it more +realistic to deploy is [MSVC's /Qspectre +switch](https://blogs.msdn.microsoft.com/vcblog/2018/01/15/spectre-mitigations-in-msvc/). +Their technique is to use static analysis within the compiler to only insert +`lfence` instructions into conditional edges at risk of attack. However, +[initial](https://arstechnica.com/gadgets/2018/02/microsofts-compiler-level-spectre-fix-shows-how-hard-this-problem-will-be-to-solve/) +[analysis](https://www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html) +has shown that this approach is incomplete and only catches a small and limited +subset of attackable patterns which happen to resemble very closely the initial +proofs of concept. As such, while its performance is acceptable, it does not +appear to be an adequate systematic mitigation. + + +## Performance Overhead + +The performance overhead of this style of comprehensive mitigation is very +high. However, it compares very favorably with previously recommended +approaches such as the `lfence` instruction. Just as users can restrict the +scope of `lfence` to control its performance impact, this mitigation technique +could be restricted in scope as well. + +However, it is important to understand what it would cost to get a fully +mitigated baseline. Here we assume targeting a Haswell (or newer) processor and +using all of the tricks to improve performance (so leaves the low 2gb +unprotected and +/- 2gb surrounding any PC in the program). We ran both +Google's microbenchmark suite and a large highly-tuned server built using +ThinLTO and PGO. All were built with `-march=haswell` to give access to BMI2 +instructions, and benchmarks were run on large Haswell servers. We collected +data both with an `lfence`-based mitigation and load hardening as presented +here. The summary is that mitigating with load hardening is 1.77x faster than +mitigating with `lfence`, and the overhead of load hardening compared to a +normal program is likely between a 10% overhead and a 50% overhead with most +large applications seeing a 30% overhead or less. + +| Benchmark | `lfence` | Load Hardening | Mitigated Speedup | +| -------------------------------------- | -------: | -------------: | ----------------: | +| Google microbenchmark suite | -74.8% | -36.4% | **2.5x** | +| Large server QPS (using ThinLTO & PGO) | -62% | -29% | **1.8x** | + +Below is a visualization of the microbenchmark suite results which helps show +the distribution of results that is somewhat lost in the summary. The y-axis is +a log-scale speedup ratio of load hardening relative to `lfence` (up -> faster +-> better). Each box-and-whiskers represents one microbenchmark which may have +many different metrics measured. The red line marks the median, the box marks +the first and third quartiles, and the whiskers mark the min and max. + +![Microbenchmark result visualization](speculative_load_hardening_microbenchmarks.png) + +We don't yet have benchmark data on SPEC or the LLVM test suite, but we can +work on getting that. Still, the above should give a pretty clear +characterization of the performance, and specific benchmarks are unlikely to +reveal especially interesting properties. + + +### Future Work: Fine Grained Control and API-Integration + +The performance overhead of this technique is likely to be very significant and +something users wish to control or reduce. There are interesting options here +that impact the implementation strategy used. + +One particularly appealing option is to allow both opt-in and opt-out of this +mitigation at reasonably fine granularity such as on a per-function basis, +including intelligent handling of inlining decisions -- protected code can be +prevented from inlining into unprotected code, and unprotected code will become +protected when inlined into protected code. For systems where only a limited +set of code is reachable by externally controlled inputs, it may be possible to +limit the scope of mitigation through such mechanisms without compromising the +application's overall security. The performance impact may also be focused in a +few key functions that can be hand-mitigated in ways that have lower +performance overhead while the remainder of the application receives automatic +protection. + +For both limiting the scope of mitigation or manually mitigating hot functions, +there needs to be some support for mixing mitigated and unmitigated code +without completely defeating the mitigation. For the first use case, it would +be particularly desirable that mitigated code remains safe when being called +during misspeculation from unmitigated code. + +For the second use case, it may be important to connect the automatic +mitigation technique to explicit mitigation APIs such as what is described in +http://wg21.link/p0928 (or any other eventual API) so that there is a clean way +to switch from automatic to manual mitigation without immediately exposing a +hole. However, the design for how to do this is hard to come up with until the +APIs are better established. We will revisit this as those APIs mature.