Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
AcceptedPublic

Authored by chandlerc on Thu, Jan 4, 1:14 AM.

Details

Summary

First, we need to explain the core of the vulnerability. Note that this
is a very incomplete description, please see the Project Zero blog post
for details:
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

The basis for branch target injection is to direct speculative execution
of the processor to some "gadget" of executable code by poisoning the
prediction of indirect branches with the address of that gadget. The
gadget in turn contains an operation that provides a side channel for
reading data. Most commonly, this will look like a load of secret data
followed by a branch on the loaded value and then a load of some
predictable cache line. The attacker then uses timing of the processors
cache to determine which direction the branch took *in the speculative
execution*, and in turn what one bit of the loaded value was. Due to the
nature of these timing side channels and the branch predictor on Intel
processors, this allows an attacker to leak data only accessible to
a privileged domain (like the kernel) back into an unprivileged domain.

The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In many
cases, the compiler can simply use directed conditional branches and
a small search tree. LLVM already has support for lowering switches in
this way and the first step of this patch is to disable jump-table
lowering of switches and introduce a pass to rewrite explicit indirectbr
sequences into a switch over integers.

However, there is no fully general alternative to indirect calls. We
introduce a new construct we call a "retpoline" to implement indirect
calls in a non-speculatable way. It can be thought of loosely as
a trampoline for indirect calls which uses the RET instruction on x86.
Further, we arrange for a specific call->ret sequence which ensures the
processor predicts the return to go to a controlled, known location. The
retpoline then "smashes" the return address pushed onto the stack by the
call with the desired target of the original indirect call. The result
is a predicted return to the next instruction after a call (which can be
used to trap speculative execution within an infinite loop) and an
actual indirect branch to an arbitrary address.

On 64-bit x86 ABIs, this is especially easily done in the compiler by
using a guaranteed scratch register to pass the target into this device.
For 32-bit ABIs there isn't a guaranteed scratch register and so several
different retpoline variants are introduced to use a scratch register if
one is available in the calling convention and to otherwise use direct
stack push/pop sequences to pass the target address.

This "retpoline" mitigation is fully described in the following blog
post: https://support.google.com/faqs/answer/7625886

There is one other important source of indirect branches in x86 ELF
binaries: the PLT. These patches also include support for LLD to
generate PLT entries that perform a retpoline-style indirection.

The only other indirect branches remaining that we are aware of are from
precompiled runtimes (such as crt0.o and similar). The ones we have
found are not really attackable, and so we have not focused on them
here, but eventually these runtimes should also be replicated for
retpoline-ed configurations for completeness.

For kernels or other freestanding or fully static executables, the
compiler switch -mretpoline is sufficient to fully mitigate this
particular attack. For dynamic executables, you must compile *all*
libraries with -mretpoline and additionally link the dynamic
executable and all shared libraries with LLD and pass -z retpolineplt
(or use similar functionality from some other linker). We strongly
recommend also using -z now as non-lazy binding allows the
retpoline-mitigated PLT to be substantially smaller.

When manually apply similar transformations to -mretpoline to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.

When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.

However, real-world workloads exhibit substantially lower performance
impact. Notably, techniques such as PGO and ThinLTO dramatically reduce
the impact of hot indirect calls (by speculatively promoting them to
direct calls) and allow optimized search trees to be used to lower
switches. If you need to deploy these techniques in C++ applications, we
*strongly* recommend that you ensure all hot call targets are statically
linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well
tuned servers using all of these techniques saw 5% - 10% overhead from
the use of retpoline.

We will add detailed documentation covering these components in
subsequent patches, but wanted to make the core functionality available
as soon as possible. Happy for more code review, but we'd really like to
get these patches landed and backported ASAP for obvious reasons. We're
planning to backport this to both 6.0 and 5.0 release streams and get
a 5.0 release with just this cherry picked ASAP for distros and vendors.

This patch is the work of a number of people over the past month: Eric, Reid,
Rui, and myself. I'm mailing it out as a single commit due to the time
sensitive nature of landing this and the need to backport it. Huge thanks to
everyone who helped out here, and everyone at Intel who helped out in
discussions about how to craft this. Also, credit goes to Paul Turner (at
Google, but not an LLVM contributor) for much of the underlying retpoline
design.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
llvm/lib/Target/X86/X86InstrCompiler.td
1160

Hi Chandler,
Do you really want to use "NotUseRetpoline" here? It will match RETPOLINE_TCRETURN32 then even it is not an indirect branch.
I guess the following test case will crash llc.

target triple = "i386-unknown-linux-gnu"

define void @FOO() {
  ret void
}

define void @FOO1() {
entry:
  tail call void @FOO()
  ret void
}
llvm/lib/Target/X86/X86RetpolineThunks.cpp
122

You will create thunk function even if it is not necessary? For example for " tail call void @FOO()"?

chandlerc marked an inline comment as done.Thu, Jan 4, 2:16 PM

That should be easy to support -- just don't do the thunk-emission -- but it does then mean the need to standardize on the names and semantics of the required thunks.

I don't think this is true. We need to document LLVM's semantics and the thunks required. If they match GCCs, cool. If not and a user provides a flag to request an external thunk, then they have to give LLVM one that matches LLVM's semantics. Since they control the name, they can even have thunks with each compiler's semantics and different names. While in practice, I expect the semantics to match, I don't think we should be in the business of forcing this into the ABI. We already have waaaay to much in the ABI. If we come up with a better way to do retpoline in the future, we should rapidly switch to that without needing to coordinate about names and semantics across compilers.

Right, I shouldn't have said "standardize", I did actually mean "document".

And it would be good if the same function-names were used as GCC. Fine to do as a followup patch, but maybe at least have an idea what the command-line UI will be for it.

I strongly disagree here. Again, this should *not* be part of the ABI and it should not be something that we have to engineer to match exactly with other compilers.

No, I agree with you, it does not _need_ to match. However, if the semantics are the same as GCC's thunks (which they do appear to be), it would be _good_ to coordinate with eachother, simply in order to make users' lives easier. Which is the same reason to coordinate the command-line UI. Neither are required, but both would be good.

We should even be free to add __llvm_retpoline2_foo() thunks in the future with new semantics with zero compatibility concerns.

It wouldn't have _zero_ concerns due to externally-provided thunks -- they'll need to provide new ones to use the new compiler version. But probably that's okay.

Sure, sounds like we're generally in agreement here. I'll take a stab at supporting external thunks and skimming emission, shouldn't be too hard.

lld/ELF/Arch/X86.cpp
491

Yeah, if you can email me a delta on top of this, I'll fold it in. Thanks!

(Suggesting that in part to make backports a bit easier...)

lld/ELF/Arch/X86_64.cpp
548–556

Will this break the 16-byte property when combined with -z now? That seems more relevant. Given that the lazy-binding is intended to only run once, I don't think its worth burning much space to speed it up.

llvm/lib/CodeGen/IndirectBrExpandPass.cpp
114

I just didn't want to hard code that assumption, but I can if you prefer.

135

I mean, yes, but also no. ;]

It would be nice to maybe preserve inline asm uses of blockaddr and not any others. And then force people to not rinse their blackaddr usage through inline asm and mix that with -mretpoline. That would allow the common usage I'm aware of to remain (showing label addresses in crash dumps in things like kernels) and really allow almost any usage wholly contained within inline asm to continue working perfectly.

But it seemed reasonable for a follow-up. That said, maybe its not too complex to add now...

llvm/lib/Target/X86/X86InstrCompiler.td
1160

Reid, could you take a look?

llvm/lib/Target/X86/X86RetpolineThunks.cpp
122

There is a FIXME above about being more careful when creating these.

sanjoy added inline comments.Thu, Jan 4, 2:38 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
135

What do you think about report_fatal_erroring here if you encounter an inline assembly user? That seems somewhat more friendly than silently "miscompiling" (in quotes) inline assembly.

rnk added inline comments.Thu, Jan 4, 2:50 PM
llvm/lib/Target/X86/X86InstrCompiler.td
1160

Here's a small diff on top of this one that fixes and tests it: https://reviews.llvm.org/P8056

efriedma added inline comments.Thu, Jan 4, 3:04 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
114

If we violate that assumption, something has gone very wrong (either we've created a blockaddress in the wrong context, or we leaked a blockaddress from the context, or we have a blockaddress with an invalid block+function pair).

Although, on a related note, you might want to check Constant::isConstantUsed(), so we don't generate indexes for blockaddresses which aren't actually referenced anywhere.

dexonsmith added inline comments.Thu, Jan 4, 3:33 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
114

FWIW, I agree with Eli that it's fundamentally bad if constants haven't been uniqued properly.

davidxl added a subscriber: davidxl.Thu, Jan 4, 4:02 PM
davidxl added inline comments.
clang/lib/Basic/Targets/X86.cpp
1333

This can go either way.

Fundamentally, the option is used to 'disable' indirect branch predictor so it sounds like target independent. However the mechanism used here 'retpoline' is very x86 specific -- it forces the HW to use call/ret predictor so that the predicted target is 'trapped' in a loop.

chandlerc updated this revision to Diff 128675.Thu, Jan 4, 4:22 PM
chandlerc marked 3 inline comments as done.

Fix several bugs caught in code review, as well as implementing suggestions.

I'm looking at adding support for external thunks next.

chandlerc updated this revision to Diff 128678.Thu, Jan 4, 4:30 PM
chandlerc marked 6 inline comments as done.

Add another minor suggestion that I forgot previously.

ruiu added a comment.Thu, Jan 4, 4:31 PM

Chandler,

Please apply https://reviews.llvm.org/D41744 to this patch. It includes the following changes:

  1. xchg is replaced with mov/pop instructions
  2. x86-64 lazy PLT relocation target is now aligned to 16 byte
  3. the x86-64 PLT header for lazy PLT resolution is shrunk from 48 bytes to 32 bytes (which became possible by utilizing the space made by (2))
chandlerc added inline comments.Thu, Jan 4, 4:32 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
114

Yeah, I'm convinced. It actually makes the code *way* simpler. I've implemented all of the suggestions here.

135

I actually think that'd be much less friendly.

The primary use case I'm aware of is for diagnostics, and I'd much rather have low-quality diagnostics in some random subsystem than be unable to use retpolines...

I've added a FIXME about this but I'm pretty nervous about breaking round-trip-through-inline-asm to repair a quality loss in diagnostics. But we can revisit this if needed.

llvm/lib/Target/X86/X86InstrCompiler.td
1160

Awesome, applied!

ab accepted this revision.Thu, Jan 4, 5:00 PM
ab added a subscriber: ab.

Clang/LLVM pieces LGTM. Thanks for doing this, people.

llvm/lib/CodeGen/IndirectBrExpandPass.cpp
57

Nit: Unnecessary 'private'?

llvm/lib/Target/X86/X86ISelLowering.cpp
27053

I wonder: should this also check the CSR regmask? There are a couple CCs that do preserve R11, but it probably doesn't matter for -mretpoline users. E.g.,

define void @t(void ()* %f) {
  call cc 17 void %f()
  ret void
}

Worth an assert?

llvm/lib/Target/X86/X86RetpolineThunks.cpp
120

'retq' at the end

172

Nit: newlines between functions

184

'noinline' makes sense, but I'm curious: is it necessary?

llvm/lib/Target/X86/X86Subtarget.h
348

Nit: "Use retpoline to avoid ..."

MaskRay added a subscriber: MaskRay.Thu, Jan 4, 5:58 PM
chandlerc updated this revision to Diff 128698.Thu, Jan 4, 6:06 PM

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Also adds Rui's work to provide more efficient PLT thunks.

Also addresses remaining feedback from Ahmed.

In D41723#967861, @ruiu wrote:

Chandler,

Please apply https://reviews.llvm.org/D41744 to this patch. It includes the following changes:

  1. xchg is replaced with mov/pop instructions
  2. x86-64 lazy PLT relocation target is now aligned to 16 byte
  3. the x86-64 PLT header for lazy PLT resolution is shrunk from 48 bytes to 32 bytes (which became possible by utilizing the space made by (2))

Done, and awesome!

Reid, do you want me to adjust the code for the 32-bit _push thunk?

chandlerc updated this revision to Diff 128700.Thu, Jan 4, 6:12 PM
chandlerc marked 16 inline comments as done.

Add the new test file for external thunks.

chandlerc marked an inline comment as done.Thu, Jan 4, 6:13 PM

All comments addressed now I think, thanks everyone! See some thoughts about a few of these below though.

llvm/lib/Target/X86/X86ISelLowering.cpp
27053

We should already have the assert below? If we get through this and have no available reg, we assert that we're in 32-bit.

That said, this really is a case that will come up. We shouldn't silently miscompile the code. I've changed both cases to report_fatal_error because these are fundamental incompatibilities with retpoline.

llvm/lib/Target/X86/X86RetpolineThunks.cpp
184

Nope, not necessary. I've removed it.

Actually, I don't think any of these are necessary because we hand build the MI and we are the last pass before the asm printer. But I've left the others for now. They seem at worst harmless.

I had one comment I think got missed: there are some references to X86::CALL64r/X86::CALL64m in X86FrameLowering.cpp and X86MCInstLower.cpp which look like they could be relevant, but aren't addressed by this patch.

ruiu added a comment.Thu, Jan 4, 9:43 PM

Chandler, please apply https://reviews.llvm.org/P8058 to address latest review comments.

eax added a subscriber: eax.Fri, Jan 5, 12:30 AM
chandlerc updated this revision to Diff 128715.Fri, Jan 5, 1:51 AM
chandlerc marked an inline comment as done.

Rebase and fold in the latest batch of changes from Rui based on the review
from Rafael.

royger added a comment.Fri, Jan 5, 1:58 AM

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

AFAICT from the code, the new option is going to be named "mretpoline_external_thunk", could we have a more generic name that could be used by all options, like:

-mindirect-thunk={retpoline,external}

This should also allow clang to implement new techniques as they become available, without having to add new options for each one of them.

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

Yeah, not sure why phab does't update that -- it will be updated when it lands.

AFAICT from the code, the new option is going to be named "mretpoline_external_thunk", could we have a more generic name that could be used by all options, like:

-mindirect-thunk={retpoline,external}

This should also allow clang to implement new techniques as they become available, without having to add new options for each one of them.

Adding options is actually cheaper than adding option parsing like this...

I'm hesitant to name the option in the external case something completely generic, because it isn't as generic as it seems IMO. We hide indirect calls and indirect branches, but we *don't* hide returns which are technically still indirects... The reason is that we are assuming that rewriting arbitrary indirects into a return is an effective mitigation. If it weren't, we'd also need to rewrite returns. This is a large component of the discussion in the linked detailed article around RSB filling -- we have to use some *other* mechanisms to defend ret, and then we rely on them being "safe".

As a consequence to all of this, this flag to clang shouldn't be used for *arbitrary* mitigations. It really should be used for mitigations stemming from return-based trampolining of indirect branches.

If we want truly generic thunk support in the future, we should add that under a separate flag name, and it should likely apply to returns as well as other forms of indirects. I know there has been some discussion about using lfence to protect indirect branch and call on AMD processors, but I'm still extremely dubious about this being an advisable mitigation. I have *significantly* more confidence in the retpoline technique being a truly effective mitigation against the attacks, and the performance we observe in the wild really is completely acceptable (for the Linux kernel, hard to even measure outside of contrived micro-benchmarks).

chandlerc updated this revision to Diff 128721.Fri, Jan 5, 2:46 AM

Add explicit checks for various lowerings that would need direct retpoline
support that are not yet implemented. These are constructs that don't show up
in typical C, C++, and other static languages today: state points, patch
points, stack probing and morestack.

I had one comment I think got missed: there are some references to X86::CALL64r/X86::CALL64m in X86FrameLowering.cpp and X86MCInstLower.cpp which look like they could be relevant, but aren't addressed by this patch.

Sorry for failing to respond previously...

I've audited all of these. They should have already triggered a hard failure if we made it through ISel though. I've added explicit and more user-friendly fatal errors if we hit these cases.

The only really interesting cases are statepoints and patchpoints. The other cases are only relevant when in the large code model on x86-64 which is pretty much riddled with bugs already. =/

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

Yeah, not sure why phab does't update that -- it will be updated when it lands.

Phabricator keeps the summary field of a revision independent of the commit message (beyond the initial upload). You can use arc diff --verbatim to force the summary to be re-copied from the commit message. However, running that will also update the Reviewers and Subscribers fields with their values in the commit message, which would probably drop a lot of subscribers in this case, so I wouldn't recommend it. You can always Edit Revision and update the summary manually if you really want.

The only really interesting cases are statepoints and patchpoints. The other cases are only relevant when in the large code model on x86-64 which is pretty much riddled with bugs already. =/

To be explicit, you can ignore statepoints and patchpoints for the moment. As the only major user of this functionality, we'll follow up with patches if needed. We're still in the process of assessing actual risk in our environment, but at the moment it looks like we likely won't need this functionality. (Obviously, subject to change as we learn more.)

Any more comments? I'd love to land this and start the backporting and merging process. =D

FWIW, I've built and linked the test suite with this in various modes, both 64-bit and 32-bit, and have no functional failures. I've not done any specific performance measurements using the LLVM test suite, but you can see our initial (very rough) performance data in the OP.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

(Er sorry, have two updates from Rafael I need to actually include, will be done momentarily...)

chandlerc updated this revision to Diff 128834.Fri, Jan 5, 5:45 PM

Teach the thunk emission to put them in comdats and enhance tests to verify
this.

Also add test coverage for nonlazybind calls which on 64-bit architectures
require retpoline there despite no user written indirect call. This already
worked, but Rafael rightly pointed out we should test it.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

Just to say that at Sony we're still doing our investigation and might be interested in lfence. But who knows, we might just zap the predictor on syscalls and context switches; for environments that have mostly a few long-running processes with comparatively few syscalls it might be net cheaper than making every indirection more expensive.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

Just to say that at Sony we're still doing our investigation and might be interested in lfence. But who knows, we might just zap the predictor on syscalls and context switches; for environments that have mostly a few long-running processes with comparatively few syscalls it might be net cheaper than making every indirection more expensive.

But retpoline doesn't make every indirection more expensive any more or less than zapping the predictor... You only build the code running in the privileged domain with retpoline, not all of the code, and they both accomplish very similar things.

The performance difference we see between something like retpoline and disabling the predictor on context switches is very significant (retpoline is much, much cheaper).

A good way to think about the cost of these things is this. The cost of retpoline we have observed on the kernel:

  1. the cost of executing the system call with "broken" indirect branch prediction (IE, reliably mispredicted), plus
  2. the cost of few extra instructions (very, very few cycles)

Both of these are very effectively mitigated by efforts to remove hot indirect branches from the system call code in the kernel. Because of the nature of most kernels, this tends to be pretty easy and desirable for performance anyways.

By comparison, the cost of toggling off the predictor is:

  1. the exact same cost as #1 above, plus
  2. the cost of toggling the MSR on every context switch

This second cost, very notably, cannot be meaningfully mitigated by PGO, or hand-implemented hot-path specializations without an indirect branch. And our measurements on Intel hardware at least show that this cost of toggling is actually the dominant cost by a very large margin.

So, you should absolutely measure the impact of the AMD solutions you have on your AMD hardware as it may be very significantly different. But I wanted to set the expectation correctly based on what limited experience we have so far (sadly only on Intel hardware).

But retpoline doesn't make every indirection more expensive any more or less than zapping the predictor... You only build the code running in the privileged domain with retpoline, not all of the code, and they both accomplish very similar things.

I do understand that it applies only to privileged code.

The performance difference we see between something like retpoline and disabling the predictor on context switches is very significant (retpoline is much, much cheaper).

I expect you are measuring this in a normal timesharing environment, which is not what we have (more below).

A good way to think about the cost of these things is this. The cost of retpoline we have observed on the kernel:

  1. the cost of executing the system call with "broken" indirect branch prediction (IE, reliably mispredicted), plus
  2. the cost of few extra instructions (very, very few cycles)

    Both of these are very effectively mitigated by efforts to remove hot indirect branches from the system call code in the kernel. Because of the nature of most kernels, this tends to be pretty easy and desirable for performance anyways.

Right.

By comparison, the cost of toggling off the predictor is:

  1. the exact same cost as #1 above, plus
  2. the cost of toggling the MSR on every context switch

    This second cost, very notably, cannot be meaningfully mitigated by PGO, or hand-implemented hot-path specializations without an indirect branch. And our measurements on Intel hardware at least show that this cost of toggling is actually the dominant cost by a very large margin.

As I said above, I expect you are measuring this on a normal timesharing system. A game console is not a normal timesharing environment. We reserve a core to the system and the game gets the rest of them. My understanding is that the game is one process and it essentially doesn't get process-context-switched (this understanding could be very flawed) although it probably does get thread-context-switched (which should still be cheap as there is no protection-domain transition involved). If there basically aren't any process context switches while running a game (which is when performance matters most) then moving even a small in-process execution cost to a larger context-switch cost is a good thing, not a bad thing. We have hard-real-time constraints to worry about.

So, you should absolutely measure the impact of the AMD solutions you have on your AMD hardware as it may be very significantly different. But I wanted to set the expectation correctly based on what limited experience we have so far (sadly only on Intel hardware).

I appreciate all the work you've done on this, and your sharing of your findings. Unfortunately on our side we are playing catch-up as we were unaware of the problem before the public announcements and the publication of this patch. We're definitely doing research and our own measurements to understand what is our best way forward.

(If it wasn't obvious, I'm not standing in the way of the patch, just noting the AMD thing which clearly can be a follow-up if it seems to be a better choice there.)

MatzeB added a subscriber: MatzeB.Mon, Jan 8, 1:20 PM
MatzeB added inline comments.
llvm/lib/Target/X86/X86RetpolineThunks.cpp
83–84

We need this pass for "correctness" and should never skip it I think.

lattera added a subscriber: lattera.Tue, Jan 9, 6:23 AM

FYI: I've imported this patch into a feature branch in HardenedBSD's playground repo. All of HardenedBSD (world + kernel) is compiled with it. I've been running it on my laptop for the past couple days without a single issue. I've enabled it locally on my box for a few applications, notably Firefox and ioquake3. Both work without issue. I plan to start an experimental package build tomorrow with it applied to the entire ports tree (around 29,400 packages).

chandlerc updated this revision to Diff 129849.Mon, Jan 15, 5:50 AM

Re-implement the indirectbr removal pass based on Rafael's suggestion. It now
creates a single switch block and threads all the indirictbr's in the function
through that block. This should give substantially smaller code especially in
the case of using retpoline where the switch block expands to a search tree
that we probably don't want to duplicate 100s of times.

chandlerc updated this revision to Diff 129850.Mon, Jan 15, 5:55 AM
chandlerc marked an inline comment as done.

Couple of other minor fixes.

Ok, I think this is all done. Rafael, I think I've implemented your suggestion as well and it still passes all my tests (including the test-suite) and a bunch of internal code I have.

Any last comments?

llvm/lib/Target/X86/X86RetpolineThunks.cpp
83–84

Agreed and done.

jyknight added a comment.EditedMon, Jan 15, 6:45 PM

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Yes for AMD, we require "lfence" instruction after the "pause" in the "retpoline" loop filler. This solution has already been accepted in GCC and Linux kernel.
Can you please do the same in LLVM as well?