Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre..
ClosedPublic

Authored by chandlerc on Jan 4 2018, 1:14 AM.

Details

Summary

First, we need to explain the core of the vulnerability. Note that this
is a very incomplete description, please see the Project Zero blog post
for details:
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

The basis for branch target injection is to direct speculative execution
of the processor to some "gadget" of executable code by poisoning the
prediction of indirect branches with the address of that gadget. The
gadget in turn contains an operation that provides a side channel for
reading data. Most commonly, this will look like a load of secret data
followed by a branch on the loaded value and then a load of some
predictable cache line. The attacker then uses timing of the processors
cache to determine which direction the branch took *in the speculative
execution*, and in turn what one bit of the loaded value was. Due to the
nature of these timing side channels and the branch predictor on Intel
processors, this allows an attacker to leak data only accessible to
a privileged domain (like the kernel) back into an unprivileged domain.

The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In many
cases, the compiler can simply use directed conditional branches and
a small search tree. LLVM already has support for lowering switches in
this way and the first step of this patch is to disable jump-table
lowering of switches and introduce a pass to rewrite explicit indirectbr
sequences into a switch over integers.

However, there is no fully general alternative to indirect calls. We
introduce a new construct we call a "retpoline" to implement indirect
calls in a non-speculatable way. It can be thought of loosely as
a trampoline for indirect calls which uses the RET instruction on x86.
Further, we arrange for a specific call->ret sequence which ensures the
processor predicts the return to go to a controlled, known location. The
retpoline then "smashes" the return address pushed onto the stack by the
call with the desired target of the original indirect call. The result
is a predicted return to the next instruction after a call (which can be
used to trap speculative execution within an infinite loop) and an
actual indirect branch to an arbitrary address.

On 64-bit x86 ABIs, this is especially easily done in the compiler by
using a guaranteed scratch register to pass the target into this device.
For 32-bit ABIs there isn't a guaranteed scratch register and so several
different retpoline variants are introduced to use a scratch register if
one is available in the calling convention and to otherwise use direct
stack push/pop sequences to pass the target address.

This "retpoline" mitigation is fully described in the following blog
post: https://support.google.com/faqs/answer/7625886

There is one other important source of indirect branches in x86 ELF
binaries: the PLT. These patches also include support for LLD to
generate PLT entries that perform a retpoline-style indirection.

The only other indirect branches remaining that we are aware of are from
precompiled runtimes (such as crt0.o and similar). The ones we have
found are not really attackable, and so we have not focused on them
here, but eventually these runtimes should also be replicated for
retpoline-ed configurations for completeness.

For kernels or other freestanding or fully static executables, the
compiler switch -mretpoline is sufficient to fully mitigate this
particular attack. For dynamic executables, you must compile *all*
libraries with -mretpoline and additionally link the dynamic
executable and all shared libraries with LLD and pass -z retpolineplt
(or use similar functionality from some other linker). We strongly
recommend also using -z now as non-lazy binding allows the
retpoline-mitigated PLT to be substantially smaller.

When manually apply similar transformations to -mretpoline to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.

When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.

However, real-world workloads exhibit substantially lower performance
impact. Notably, techniques such as PGO and ThinLTO dramatically reduce
the impact of hot indirect calls (by speculatively promoting them to
direct calls) and allow optimized search trees to be used to lower
switches. If you need to deploy these techniques in C++ applications, we
*strongly* recommend that you ensure all hot call targets are statically
linked (avoiding PLT indirection) and use both PGO and ThinLTO. Well
tuned servers using all of these techniques saw 5% - 10% overhead from
the use of retpoline.

We will add detailed documentation covering these components in
subsequent patches, but wanted to make the core functionality available
as soon as possible. Happy for more code review, but we'd really like to
get these patches landed and backported ASAP for obvious reasons. We're
planning to backport this to both 6.0 and 5.0 release streams and get
a 5.0 release with just this cherry picked ASAP for distros and vendors.

This patch is the work of a number of people over the past month: Eric, Reid,
Rui, and myself. I'm mailing it out as a single commit due to the time
sensitive nature of landing this and the need to backport it. Huge thanks to
everyone who helped out here, and everyone at Intel who helped out in
discussions about how to craft this. Also, credit goes to Paul Turner (at
Google, but not an LLVM contributor) for much of the underlying retpoline
design.

There are a very large number of changes, so older changes are hidden. Show Older Changes
chandlerc added inline comments.Jan 4 2018, 4:32 PM
llvm/lib/CodeGen/IndirectBrExpandPass.cpp
113

Yeah, I'm convinced. It actually makes the code *way* simpler. I've implemented all of the suggestions here.

llvm/lib/Target/X86/X86InstrCompiler.td
1160

Awesome, applied!

ab accepted this revision.Jan 4 2018, 5:00 PM
ab added a subscriber: ab.

Clang/LLVM pieces LGTM. Thanks for doing this, people.

llvm/lib/CodeGen/IndirectBrExpandPass.cpp
56

Nit: Unnecessary 'private'?

llvm/lib/Target/X86/X86ISelLowering.cpp
27120

I wonder: should this also check the CSR regmask? There are a couple CCs that do preserve R11, but it probably doesn't matter for -mretpoline users. E.g.,

define void @t(void ()* %f) {
  call cc 17 void %f()
  ret void
}

Worth an assert?

llvm/lib/Target/X86/X86RetpolineThunks.cpp
119

'retq' at the end

171

Nit: newlines between functions

183

'noinline' makes sense, but I'm curious: is it necessary?

llvm/lib/Target/X86/X86Subtarget.h
344

Nit: "Use retpoline to avoid ..."

MaskRay added a subscriber: MaskRay.Jan 4 2018, 5:58 PM
chandlerc updated this revision to Diff 128698.Jan 4 2018, 6:06 PM

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Also adds Rui's work to provide more efficient PLT thunks.

Also addresses remaining feedback from Ahmed.

In D41723#967861, @ruiu wrote:

Chandler,

Please apply https://reviews.llvm.org/D41744 to this patch. It includes the following changes:

  1. xchg is replaced with mov/pop instructions
  2. x86-64 lazy PLT relocation target is now aligned to 16 byte
  3. the x86-64 PLT header for lazy PLT resolution is shrunk from 48 bytes to 32 bytes (which became possible by utilizing the space made by (2))

Done, and awesome!

Reid, do you want me to adjust the code for the 32-bit _push thunk?

chandlerc updated this revision to Diff 128700.Jan 4 2018, 6:12 PM
chandlerc marked 16 inline comments as done.

Add the new test file for external thunks.

chandlerc marked an inline comment as done.Jan 4 2018, 6:13 PM

All comments addressed now I think, thanks everyone! See some thoughts about a few of these below though.

llvm/lib/Target/X86/X86ISelLowering.cpp
27120

We should already have the assert below? If we get through this and have no available reg, we assert that we're in 32-bit.

That said, this really is a case that will come up. We shouldn't silently miscompile the code. I've changed both cases to report_fatal_error because these are fundamental incompatibilities with retpoline.

llvm/lib/Target/X86/X86RetpolineThunks.cpp
183

Nope, not necessary. I've removed it.

Actually, I don't think any of these are necessary because we hand build the MI and we are the last pass before the asm printer. But I've left the others for now. They seem at worst harmless.

I had one comment I think got missed: there are some references to X86::CALL64r/X86::CALL64m in X86FrameLowering.cpp and X86MCInstLower.cpp which look like they could be relevant, but aren't addressed by this patch.

ruiu added a comment.Jan 4 2018, 9:43 PM

Chandler, please apply https://reviews.llvm.org/P8058 to address latest review comments.

eax added a subscriber: eax.Jan 5 2018, 12:30 AM
chandlerc updated this revision to Diff 128715.Jan 5 2018, 1:51 AM
chandlerc marked an inline comment as done.

Rebase and fold in the latest batch of changes from Rui based on the review
from Rafael.

royger added a comment.Jan 5 2018, 1:58 AM

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

AFAICT from the code, the new option is going to be named "mretpoline_external_thunk", could we have a more generic name that could be used by all options, like:

-mindirect-thunk={retpoline,external}

This should also allow clang to implement new techniques as they become available, without having to add new options for each one of them.

Add support for externally provided thunks. This is an independent feature;
when combined with the overall retpoline feature it suppresses the thunk
emission and rotates the names to be distinct names that an external build
system for the kernel (for example) can provide.

I've added some minimal documentation about the semantic requirements of these
thunks to the commit log, although it is fairly obvious. More comprehensive
documentation will be part of the large follow-up effort around docs.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

Yeah, not sure why phab does't update that -- it will be updated when it lands.

AFAICT from the code, the new option is going to be named "mretpoline_external_thunk", could we have a more generic name that could be used by all options, like:

-mindirect-thunk={retpoline,external}

This should also allow clang to implement new techniques as they become available, without having to add new options for each one of them.

Adding options is actually cheaper than adding option parsing like this...

I'm hesitant to name the option in the external case something completely generic, because it isn't as generic as it seems IMO. We hide indirect calls and indirect branches, but we *don't* hide returns which are technically still indirects... The reason is that we are assuming that rewriting arbitrary indirects into a return is an effective mitigation. If it weren't, we'd also need to rewrite returns. This is a large component of the discussion in the linked detailed article around RSB filling -- we have to use some *other* mechanisms to defend ret, and then we rely on them being "safe".

As a consequence to all of this, this flag to clang shouldn't be used for *arbitrary* mitigations. It really should be used for mitigations stemming from return-based trampolining of indirect branches.

If we want truly generic thunk support in the future, we should add that under a separate flag name, and it should likely apply to returns as well as other forms of indirects. I know there has been some discussion about using lfence to protect indirect branch and call on AMD processors, but I'm still extremely dubious about this being an advisable mitigation. I have *significantly* more confidence in the retpoline technique being a truly effective mitigation against the attacks, and the performance we observe in the wild really is completely acceptable (for the Linux kernel, hard to even measure outside of contrived micro-benchmarks).

chandlerc updated this revision to Diff 128721.Jan 5 2018, 2:46 AM

Add explicit checks for various lowerings that would need direct retpoline
support that are not yet implemented. These are constructs that don't show up
in typical C, C++, and other static languages today: state points, patch
points, stack probing and morestack.

I had one comment I think got missed: there are some references to X86::CALL64r/X86::CALL64m in X86FrameLowering.cpp and X86MCInstLower.cpp which look like they could be relevant, but aren't addressed by this patch.

Sorry for failing to respond previously...

I've audited all of these. They should have already triggered a hard failure if we made it through ISel though. I've added explicit and more user-friendly fatal errors if we hit these cases.

The only really interesting cases are statepoints and patchpoints. The other cases are only relevant when in the large code model on x86-64 which is pretty much riddled with bugs already. =/

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Thanks! I'm however not seeing the updated commit message that contains the usage documentation of the new option in the differential revision.

Yeah, not sure why phab does't update that -- it will be updated when it lands.

Phabricator keeps the summary field of a revision independent of the commit message (beyond the initial upload). You can use arc diff --verbatim to force the summary to be re-copied from the commit message. However, running that will also update the Reviewers and Subscribers fields with their values in the commit message, which would probably drop a lot of subscribers in this case, so I wouldn't recommend it. You can always Edit Revision and update the summary manually if you really want.

The only really interesting cases are statepoints and patchpoints. The other cases are only relevant when in the large code model on x86-64 which is pretty much riddled with bugs already. =/

To be explicit, you can ignore statepoints and patchpoints for the moment. As the only major user of this functionality, we'll follow up with patches if needed. We're still in the process of assessing actual risk in our environment, but at the moment it looks like we likely won't need this functionality. (Obviously, subject to change as we learn more.)

Any more comments? I'd love to land this and start the backporting and merging process. =D

FWIW, I've built and linked the test suite with this in various modes, both 64-bit and 32-bit, and have no functional failures. I've not done any specific performance measurements using the LLVM test suite, but you can see our initial (very rough) performance data in the OP.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

(Er sorry, have two updates from Rafael I need to actually include, will be done momentarily...)

chandlerc updated this revision to Diff 128834.Jan 5 2018, 5:45 PM

Teach the thunk emission to put them in comdats and enhance tests to verify
this.

Also add test coverage for nonlazybind calls which on 64-bit architectures
require retpoline there despite no user written indirect call. This already
worked, but Rafael rightly pointed out we should test it.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

Just to say that at Sony we're still doing our investigation and might be interested in lfence. But who knows, we might just zap the predictor on syscalls and context switches; for environments that have mostly a few long-running processes with comparatively few syscalls it might be net cheaper than making every indirection more expensive.

For AMD processors we may be able to handle indirect jumps via a simpler lfence mechanism. Indirect calls may still require retpoline. If this turns out to be the right solution for AMD processors we may need to put some code in to support this.

Yeah, if it ends up that we want non-retpoline mitigations for AMD we can and should add them. One hope I have is that this patch is at least generically *sufficient* (when paired with correct RSB filling) even if it suboptimal in some cases and we end up adding more precise tools later.

Just to say that at Sony we're still doing our investigation and might be interested in lfence. But who knows, we might just zap the predictor on syscalls and context switches; for environments that have mostly a few long-running processes with comparatively few syscalls it might be net cheaper than making every indirection more expensive.

But retpoline doesn't make every indirection more expensive any more or less than zapping the predictor... You only build the code running in the privileged domain with retpoline, not all of the code, and they both accomplish very similar things.

The performance difference we see between something like retpoline and disabling the predictor on context switches is very significant (retpoline is much, much cheaper).

A good way to think about the cost of these things is this. The cost of retpoline we have observed on the kernel:

  1. the cost of executing the system call with "broken" indirect branch prediction (IE, reliably mispredicted), plus
  2. the cost of few extra instructions (very, very few cycles)

Both of these are very effectively mitigated by efforts to remove hot indirect branches from the system call code in the kernel. Because of the nature of most kernels, this tends to be pretty easy and desirable for performance anyways.

By comparison, the cost of toggling off the predictor is:

  1. the exact same cost as #1 above, plus
  2. the cost of toggling the MSR on every context switch

This second cost, very notably, cannot be meaningfully mitigated by PGO, or hand-implemented hot-path specializations without an indirect branch. And our measurements on Intel hardware at least show that this cost of toggling is actually the dominant cost by a very large margin.

So, you should absolutely measure the impact of the AMD solutions you have on your AMD hardware as it may be very significantly different. But I wanted to set the expectation correctly based on what limited experience we have so far (sadly only on Intel hardware).

But retpoline doesn't make every indirection more expensive any more or less than zapping the predictor... You only build the code running in the privileged domain with retpoline, not all of the code, and they both accomplish very similar things.

I do understand that it applies only to privileged code.

The performance difference we see between something like retpoline and disabling the predictor on context switches is very significant (retpoline is much, much cheaper).

I expect you are measuring this in a normal timesharing environment, which is not what we have (more below).

A good way to think about the cost of these things is this. The cost of retpoline we have observed on the kernel:

  1. the cost of executing the system call with "broken" indirect branch prediction (IE, reliably mispredicted), plus
  2. the cost of few extra instructions (very, very few cycles)

    Both of these are very effectively mitigated by efforts to remove hot indirect branches from the system call code in the kernel. Because of the nature of most kernels, this tends to be pretty easy and desirable for performance anyways.

Right.

By comparison, the cost of toggling off the predictor is:

  1. the exact same cost as #1 above, plus
  2. the cost of toggling the MSR on every context switch

    This second cost, very notably, cannot be meaningfully mitigated by PGO, or hand-implemented hot-path specializations without an indirect branch. And our measurements on Intel hardware at least show that this cost of toggling is actually the dominant cost by a very large margin.

As I said above, I expect you are measuring this on a normal timesharing system. A game console is not a normal timesharing environment. We reserve a core to the system and the game gets the rest of them. My understanding is that the game is one process and it essentially doesn't get process-context-switched (this understanding could be very flawed) although it probably does get thread-context-switched (which should still be cheap as there is no protection-domain transition involved). If there basically aren't any process context switches while running a game (which is when performance matters most) then moving even a small in-process execution cost to a larger context-switch cost is a good thing, not a bad thing. We have hard-real-time constraints to worry about.

So, you should absolutely measure the impact of the AMD solutions you have on your AMD hardware as it may be very significantly different. But I wanted to set the expectation correctly based on what limited experience we have so far (sadly only on Intel hardware).

I appreciate all the work you've done on this, and your sharing of your findings. Unfortunately on our side we are playing catch-up as we were unaware of the problem before the public announcements and the publication of this patch. We're definitely doing research and our own measurements to understand what is our best way forward.

(If it wasn't obvious, I'm not standing in the way of the patch, just noting the AMD thing which clearly can be a follow-up if it seems to be a better choice there.)

MatzeB added a subscriber: MatzeB.Jan 8 2018, 1:20 PM
MatzeB added inline comments.
llvm/lib/Target/X86/X86RetpolineThunks.cpp
83–84

We need this pass for "correctness" and should never skip it I think.

lattera added a subscriber: lattera.Jan 9 2018, 6:23 AM

FYI: I've imported this patch into a feature branch in HardenedBSD's playground repo. All of HardenedBSD (world + kernel) is compiled with it. I've been running it on my laptop for the past couple days without a single issue. I've enabled it locally on my box for a few applications, notably Firefox and ioquake3. Both work without issue. I plan to start an experimental package build tomorrow with it applied to the entire ports tree (around 29,400 packages).

chandlerc updated this revision to Diff 129849.Jan 15 2018, 5:50 AM

Re-implement the indirectbr removal pass based on Rafael's suggestion. It now
creates a single switch block and threads all the indirictbr's in the function
through that block. This should give substantially smaller code especially in
the case of using retpoline where the switch block expands to a search tree
that we probably don't want to duplicate 100s of times.

chandlerc updated this revision to Diff 129850.Jan 15 2018, 5:55 AM
chandlerc marked an inline comment as done.

Couple of other minor fixes.

Ok, I think this is all done. Rafael, I think I've implemented your suggestion as well and it still passes all my tests (including the test-suite) and a bunch of internal code I have.

Any last comments?

llvm/lib/Target/X86/X86RetpolineThunks.cpp
83–84

Agreed and done.

jyknight added a comment.EditedJan 15 2018, 6:45 PM

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Yes for AMD, we require "lfence" instruction after the "pause" in the "retpoline" loop filler. This solution has already been accepted in GCC and Linux kernel.
Can you please do the same in LLVM as well?

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Thanks for digging all of this up, but I have to say that it would be really awesome of folks from AMD would actually comment on this thread and/or patch rather than us relaying things 2nd and 3rd hand....

I'll look at implementing this, but I'm not super thrilled to change so much of the code at this point. The code as-is is secure, and merely power-inefficient on AMD chips. I'd like to fix that, but if it creates problems in testing, I'm inclined to wait for AMD to actually join the discussion.

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Yes for AMD, we require "lfence" instruction after the "pause" in the "retpoline" loop filler. This solution has already been accepted in GCC and Linux kernel.
Can you please do the same in LLVM as well?

Ahh, I see my email crossed yours, sorry.

Have you tested adding 'lfence' and this patch on any AMD platforms? Do you have any results? Can you confirm that these patches are actually working?

I'll look at implementing this, but I'm not super thrilled to change so much of the code at this point.

It seems potentially reasonable to me to commit as-is, and do that as a follow-on patch.

I still think the interface should be fixed, so that later on when lfence (or other thunks) is added clang doesn't end up with:

-mretpoline_external_thunk and -mlfence_external_thunk

Which doesn't make sense. IMHO the interface should be something like:

-mindirect-thunk={retpoline,lfence,external,...}

Or similar, ie: a single option that allows the user to select the thunk to use.

gcc is using such interface:

https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01041.html

anujm1 added a subscriber: anujm1.Jan 16 2018, 5:35 PM

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Per kernel https://marc.info/?l=linux-kernel&m=151580566622935&w=2 and gcc https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01059.html, it seems AMD needs there to be an lfence in the speculation trap (and the pause is not useful for them, but does no harm). There seems to be some speculation (but no confirmation yet?) that pause *is* necessary vs lfence on intel. So in order to work generically, they seem to be suggesting using both instructions:

loop:
  pause
  lfence
  jmp loop

Some more links
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01209.html
and final patch:
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Yes for AMD, we require "lfence" instruction after the "pause" in the "retpoline" loop filler. This solution has already been accepted in GCC and Linux kernel.
Can you please do the same in LLVM as well?

Ahh, I see my email crossed yours, sorry.

Have you tested adding 'lfence' and this patch on any AMD platforms? Do you have any results? Can you confirm that these patches are actually working?

Given the lack of test case for this issue, We just tested SPEC2k17 with GCC 'retpoline' patch ('pause' vs 'pause+lfence') on AMD Zen.
There is no overhead on adding 'lfence' after 'pause'.

Also for AMD we need 'lfence' as it is a dispatch serializing instruction.
Please refer: https://www.spinics.net/lists/kernel/msg2697621.html.

chandlerc updated this revision to Diff 130532.Jan 18 2018, 6:14 PM

Add the lfence instruction to all of the speculation capture loops, both
those generated by LLVM and those generated by LLD for the PLT.

This should ensure that both Intel and AMD processors stop speculating these
loops rather than continuing to consume resources (and power) uselessly.

Ping! Would really like to get this landed, but as it has changed somewhat, looking for a final round of review....

MatzeB accepted this revision.Jan 19 2018, 5:11 PM

LGTM, I'm all for getting this into the LLVM tree, we can fine tune later as necessary.

This is the first widely used machine module pass. Adding a module pass into the pipeline means we will have all the MI in memory at the same time rather than creating MI for 1 function at a time and freeing it after emitting each function. So it would be good to keep an eye on the compilers memory consumption... Maybe we can find a way to refactor the codegen pipeline later so that we can go back to 1 function at a time and still have a way to emit some new code afterwards...

llvm/include/llvm/CodeGen/TargetPassConfig.h
416–417

Even though this documentation is mostly copied, how about changing it to:

Targets may add passes immediately before machine code is emitted in this callback.
This is called later than addPreEmitPass().
418

I find the name addEmitPass() misleading as it isn't adding the assembly emission passes. The best bad name I can think of right now is addPreEmit2(), in the spirit of the existing addPreSched2() callback...

llvm/lib/Target/X86/X86RetpolineThunks.cpp
84–85

There shouldn't be any way to use any Target/X86 without also having a TargetPassConfig in the pipeline. An assert should be enough.

chandlerc marked 3 inline comments as done.Jan 22 2018, 1:29 AM

Thanks Matthias and Eric!

After a discussion on IRC, the general conclusion I came to is that let's try this as-is with a clean machine module pass. If we get reports of issues, it should be pretty straight forward to hack it up to use machine function passes instead, but it seems quite a bit cleaner as-is, and everyone seems very hopeful that this will not be an issue in practice. (And none of the testing we've done so far indicate any issues with memory consumption.)

Seems like this has pretty good consensus now (Matthias, Rafael, several others). I'm planning to land this first thing tomorrow morning (west coast US time) and then will start working on backports to release branches.

Thanks again everyone, especially those who helped author large parts of this. Will of course keep an eye out for follow-ups or other issues.

llvm/include/llvm/CodeGen/TargetPassConfig.h
418

I suspect the right thing to do is to change the name of addPreEmitPass to something more representative of reality and document that correctly. Then we can use the obvious name of addPreEmitPass.

But I'd prefer to do the name shuffling of *existing* APIs in a follow-up patch. I've used addPreEmitPass2 here even though I really dislike that name, and left a FIXME.

This revision was automatically updated to reflect the committed changes.
chandlerc marked an inline comment as done.

@chandlerc is MacOS support omitted intentionally (ie. not implemented yet)? When I try using the -mretpoline flag on High Sierra I get the following:

fatal error: error in backend: MachO doesn't support COMDATs, '__llvm_retpoline_r11' cannot be lowered.

@chandlerc is MacOS support omitted intentionally (ie. not implemented yet)?

Not at all, I just don't have a Mac system to test with...

When I try using the -mretpoline flag on High Sierra I get the following:

fatal error: error in backend: MachO doesn't support COMDATs, '__llvm_retpoline_r11' cannot be lowered.

This is just that I unconditionally used comdats and that doesn't work on MachO. We should use some other lowering strategy to ensure the thunks are merged by the Mac linker. I'm not an expert there, so would defer to others like Ahmed. Happy to review a patch fixing it though.

GGanesh removed a subscriber: GGanesh.
GGanesh added a subscriber: GGanesh.