This is an archive of the discontinued LLVM Phabricator instance.

Use branch funnels for virtual calls when retpoline mitigation is enabled.
ClosedPublic

Authored by pcc on Jan 23 2018, 5:08 PM.

Details

Summary

The retpoline mitigation for variant 2 of CVE-2017-5715 inhibits the
branch predictor, and as a result it can lead to a measurable loss of
performance. We can reduce the performance impact of retpolined virtual
calls by replacing them with a special construct known as a branch
funnel, which is an instruction sequence that implements virtual calls
to a set of known targets using a binary tree of direct branches. This
allows the processor to speculately execute valid implementations of the
virtual function without allowing for speculative execution of of calls
to arbitrary addresses.

This patch extends the whole-program devirtualization pass to replace
certain virtual calls with calls to branch funnels, which are
represented using a new llvm.icall.jumptable intrinsic. It also extends
the LowerTypeTests pass to recognize the new intrinsic, generate code
for the branch funnels (x86_64 only for now) and lay out virtual tables
as required for each branch funnel.

The implementation supports full LTO as well as ThinLTO, and extends the
ThinLTO summary format used for whole-program devirtualization to
support branch funnels.

For more details see RFC:
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120672.html

Event Timeline

pcc created this revision.Jan 23 2018, 5:08 PM
chandlerc added inline comments.Jan 23 2018, 6:00 PM
llvm/include/llvm/IR/Intrinsics.td
877

A jump table usually refers to an *indirect* jump using a table of addresses...

Maybe "search tree" or "branch funnel" or some other term would be more clear here and elsewhere?

llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

Emitting all of this with (I assume) inline assembly seems like a really messy design. It makes all of this entirely x86-specific, but *this* part is completely independent of architecture.

Why not emit LLVM IR? Is there no pattern of IR that actually lowers to reasonable branches? I find that a little bit surprising, but maybe we should just fix the lowering in that case?

pcc added inline comments.Jan 23 2018, 8:26 PM
llvm/include/llvm/IR/Intrinsics.td
877

Yes, I'm a little inconsistent in the terminology in this patch. I'll go through the patch and try to consistently use "branch funnel".

llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

The main reason is that we need to make sure that whatever code we generate conforms with the calling convention used for the virtual call. At this point there is no guarantee that we have correct information about the calling convention to be used here because the "source of truth" for the calling convention is the call site itself, and with ThinLTO there may not be a call site in the current module. We cannot even rely on the function prototype with ThinLTO because we may have discarded the prototype by this point. Furthermore there doesn't seem to be a point in preserving the calling convention because no matter what it is, we will always want the same code. That is what led me to the conclusion that what we need here is an IR construct that represents the notion of a branch funnel call to a function with an unknown prototype -- that is what the llvm.icall.jumptable intrinsic is.

I considered putting the implementation of this intrinsic in the backend like a regular intrinsic, but one of the problems with that is that one of the things that we need to be able to do is stitch together multiple global variables into a single global in order to implement the branch funnel as a binary search, and it is too late to do that in the backend. The LowerTypeTests pass was a convenient place to do it (since it already needs to know how to do it in order to implement llvm.type.test) but admittedly it may not be the best place. It occurred to me that it may be possible to implement the stitching together of globals in a pre-isel pass and the rest of llvm.icall.jumptable in the backend, but I think that's something that's best considered for a followup patch since it may have implications for how we implement llvm.type.test as well.

chandlerc added inline comments.Jan 23 2018, 8:40 PM
llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

The more I think about this the more I think that implementing this with inline asm is just the wrong design.

I understand the challenges you're hitting here, but I think we should solve them by lowering in the backend, not in the IR. The IR really can't model the kinds of things you're doing (and it shouldn't!).

My suggestion would be to change the llvm.icall.<mumble> intrinsic or add a during-lowering intrinsic that lets you prepare the global variable stuff as necessary in the IR pass, and hand that prepared information cleanly (and abstractly) to the backend where you can effectively lower it to branch instructions. Since you are (notionally) lowering a call, you may even be able to use the intrinsic all the way into the code generator and just lower this manually with an MI pass.

Another aspect that this will improve is that you shouldn't need to embed things into naked functions at all. This should be something that you can just expand wherever it is needed into the code.

pcc added inline comments.Jan 24 2018, 11:50 AM
llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

I'm not sure that I understand your proposal. Are you proposing that in the WholeProgramDevirt pass we would transform each virtual call site into an intrinsic call with the list of possible targets, and then in the backend we would lower the intrinsic call into a call to a branch funnel "function" together with its definition?

chandlerc added inline comments.Jan 24 2018, 6:59 PM
llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

I'm imagining in the backend, we would lower the intrinsic call into an *inline* branch funnel across the destinations.

(I can also imagine using a pseudo instead of call to an intrinsic, but as you want to capture a call with some specific calling convention / signature / etc, using a call to an intrinsic seems a reasonable representation... if needed, you could even put the potential destinations into an operand bundle so that the formal argument list *is* actually the argument list which should be used for calling the target function.)

pcc added inline comments.Jan 24 2018, 7:54 PM
llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

Are you sure you want an inline branch funnel? It will make the code more "branchy", which doesn't seem great for code size.

i.e. the two options are

call branchfunnel

branchfunnel:
cmp ...
jb target1
je target2
jmp target3

and

cmp ...
jb .Ltmp1
je .Ltmp2
call target3
jmp .Ltmp3
.Ltmp1:
call target1
jmp .Ltmp3
.Ltmp2:
call target2
.Ltmp3:

.

Putting that aside, there are problems with the idea of putting the list of targets in every intrinsic. Most importantly, it complicates matters significantly for ThinLTO. It means that each backend job would need to know the list of targets as well as the layout of the combined global that stores the vtables. We don't need that for any other purpose, so we would need to include that information in the summary just to support this. Because we are including the information in the summary, it means that any change to the class hierarchy would cause a ThinLTO cache miss. (Right now, in many cases, we can avoid a cache miss as a result of careful summary design.) In other words, this proposal would be making the code more complicated and less efficient. So it doesn't seem like a good direction to me.

I think we can both agree that lowering to inline asm isn't great. But I see a different way for this code to evolve so that it is properly layered. Essentially we would invent a new top-level entity (like a function or a global variable) that represents a thunk. There is already a need for such a top-level entity to represent CFI jump tables (currently we represent them in the same hackish way using inline asm), so to start with there would be two kinds of thunks: jump tables and branch funnels. Target specific code in the backend would lower the thunks to MI or MCInstrs.

This would have a few advantages which would be shared with your intrinsic proposal:

  • the IR would be somewhat platform independent until the backend,
  • thus it can be easily optimized by midlevel passes,
  • the backend would be able to choose whether to emit the branch funnel inline or outline,

and most importantly for ThinLTO:

  • it would support separate compilation.

Please let me know what you think.

Sorry for the delay, took me a bit to internalize what you were proposing.

llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

Having separate inline branch funnels may actually be nice when there are very few targets as they may be better predicted. There is a size/speed trade-off here.

However, the separate compilation issue is very compelling. I think you can use an intrinsic to implement your proposal and get the advantages you describe without any new top-level construct:

You could define / declare a thunk IR function which contains just this intrinsic. Only one TU gets the definition which calls the intrinsic and needs the global information. The others just get a declaration. No new top-level construct, and you lower it exactly as you describe.

I'm not sure if this will directly map to the jump table case, but I'm happy for that to be dealt with in a follow-up as that is already in tree.

If this idea doesn't work, let's chat about what alternatives would work or how to model the top-level entity. I feel pretty strongly that the current code with the current amount of inline assembly is really not up to the quality that should go into the tree even as a temporary thing. There doesn't seem to be such urgency here that we can't take the time to engineer this the right way.

Also:

  1. We didn't actually clean up the inline asm for the CFI jump tables despite those being in-tree for a long time, so I think it is reasonable to insist we don't make things worse.
  2. This is a *much* more significant usage of inline asm, and so it seems much less reasonable as a temporary solution.
pcc added inline comments.Feb 5 2018, 9:08 PM
llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

You could define / declare a thunk IR function which contains just this intrinsic. Only one TU gets the definition which calls the intrinsic and needs the global information. The others just get a declaration. No new top-level construct, and you lower it exactly as you describe.

The problem with that is: what would be the signature/calling convention of that function? That is the problem that I described on the first comment on this thread.

If this idea doesn't work, let's chat about what alternatives would work or how to model the top-level entity. I feel pretty strongly that the current code with the current amount of inline assembly is really not up to the quality that should go into the tree even as a temporary thing. There doesn't seem to be such urgency here that we can't take the time to engineer this the right way.

Also:

  1. We didn't actually clean up the inline asm for the CFI jump tables despite those being in-tree for a long time, so I think it is reasonable to insist we don't make things worse.
  2. This is a *much* more significant usage of inline asm, and so it seems much less reasonable as a temporary solution.

Okay, fair enough. Assuming that you agree with my objection above, I will try to sketch out a more detailed plan for how I think the new top-level entity should look in the IR.

echristo added inline comments.Feb 8 2018, 11:28 AM
llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

I think you could count it as a naked function yes? I don't see a reason for a top level "thunk" entity in the IR necessarily.

pcc updated this revision to Diff 135173.Feb 20 2018, 4:55 PM
  • First draft of MI-based lowering -- tests to come
pcc added inline comments.Feb 20 2018, 5:12 PM
llvm/lib/Transforms/IPO/LowerTypeTests.cpp
1086–1088

The main issue was that the thunk needs to be calling-convention-independent, which I thought was unrepresentable in IR, but it turns out that we can base it on a representation which is apparently used for thunks on Windows. That representation is to mark the thunk as varargs and mark the call as musttail. That is what this patch implements.

pcc added a comment.Feb 27 2018, 5:27 PM

@chandlerc ping -- let me know whether this MI-based implementation seems reasonable to you.

In D42453#1021513, @pcc wrote:

@chandlerc ping -- let me know whether this MI-based implementation seems reasonable to you.

Thanks for the ping, I missed this while on vacation last week. Will have feedback shortly....

I've not looked at all of the stuff yet, but focusing on the MI and lowering, yeah, this looks exactly like the direction I was imagining. Do you want me to go ahead and do a full review, or do you have other cleanups you need to make first? Were there any problems you ran into with this approach (other than the need to wire everything up here) that still need to be sorted out or that I can help with?

pcc added a comment.Feb 27 2018, 6:26 PM

The code is ready for review, but the tests still need to be updated. Up to you whether you'd like to take a look now or wait for the tests.

chandlerc requested changes to this revision.Mar 2 2018, 2:23 AM

Did a pass over the code. Really only minor nits and naming questions. Generally, the structure and such makes a lot of sense to me now. Thanks so much for working on threading this through all the layers so nicely!

Happy to take another look when you get tests and such updated.

llvm/include/llvm/IR/ModuleSummaryIndex.h
607–608

Is it really a jump table though? You use this term throughout, so only commenting here, but I wonder if there is a better term that makes it more clear what is going on here. I'm worried people will assume this is actually implemented with a jump table as opposed to a branch funnel (or binary-ish search).

llvm/include/llvm/Target/Target.td
1146

Does it make sense to put a comment string here to make reading the dump of MI easier? (I'm not sure, genuine question here.)

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6049

SmallVector? Seems likely to have lots of cases where 8 or 16 would handle..

6064

Same question here.

6070–6071

Is it worth teaching the verifier about this and the other invariants below? Could then make this just an assert. Minor point though, and would be fine as a follow-up.

llvm/lib/Transforms/IPO/LowerTypeTests.cpp
297

nit: I would have capitalized Icall as ICall throughout.

This revision now requires changes to proceed.Mar 2 2018, 2:23 AM
pcc updated this revision to Diff 137512.Mar 7 2018, 5:02 PM
pcc marked 3 inline comments as done.
  • Update tests and fix a couple of benign bugs in the code
  • SmallVector
  • Renaming
pcc added inline comments.Mar 7 2018, 5:02 PM
llvm/include/llvm/IR/ModuleSummaryIndex.h
607–608

Sorry, that was one of the things that I forgot to update in the code. Renamed to "branch funnel".

llvm/include/llvm/Target/Target.td
1146

No, the MI dump already says "ICALL_BRANCH_FUNNEL". This string is used in the asm output for instructions that survive until MC, and this one doesn't.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6070–6071

This particular check can't be a verifier check because the wholeprogramdevirt pass creates IR that would not pass this check (it is later fixed up by lowertypetests). The others would be fine as verifier checks and we might do that as a followup.

chandlerc accepted this revision.Mar 8 2018, 4:03 PM

This looks really, really nice. Thanks for all the hard work here!

llvm/include/llvm/Target/Target.td
1146

Ah, nice. Thanks for checking!

This revision is now accepted and ready to land.Mar 8 2018, 4:03 PM
This revision was automatically updated to reflect the committed changes.