This is an archive of the discontinued LLVM Phabricator instance.

[XRay][X86] Use a valid instruction for the synthetic reference.
AbandonedPublic

Authored by dberris on Jul 26 2017, 10:03 PM.

Details

Summary

Before this change, we unconditionally write out a synthetic reference to the
XRay function index label in the text section. This makes it so that we're
writing valid instructions in the text section before the actual
relocation/reference.

With this change we're aiming to help the CPU's decoder to decode valid
instructions instead of potentially stalling because of almost quite literally
garbage data in the text section. To do this, we're introducing a protected
virtual function in the AsmPrinter class, which is specifically an extension
point for emitting the XRay synthetic reference.

We're using the INT3 and 7-byte nops for padding before emitting the actual
relocation/reference.

Event Timeline

dberris created this revision.Jul 26 2017, 10:03 PM

Hi Craig, Reid, and Evgenii,

Do you have alternatives to movabsq for what I'm trying to achieve here?

Thanks in advance!

craig.topper edited edge metadata.Jul 26 2017, 10:31 PM

Does xray only work on 64-bit x86?

R10D isn't a valid register for MOV64ri. It should be R10. The D is the 32-bit size.

I can't think of a better instruction if you need all the bits. movabsq is the only instruciton that takes a 64-bit immediate. The only other option might be one of the loads into al/ax/eax/rax from a 64-bit absolute address.

Does xray only work on 64-bit x86?

Yes, we don't have an implementation for 32-bit x86.

If we end up doing so, we will need to update the implementation to check and use the right instructions for 32-bit. Not today though. :)

R10D isn't a valid register for MOV64ri. It should be R10. The D is the 32-bit size.

Whoops -- good catch, thanks!

I can't think of a better instruction if you need all the bits. movabsq is the only instruciton that takes a 64-bit immediate. The only other option might be one of the loads into al/ax/eax/rax from a 64-bit absolute address.

I see. I think writing to the scratch register from outside of the function bodies might be more "benign" if it ever gets executed (compared to a read into one of the other registers). Although I'm not a security expert, so maybe the tradeoff here isn't as clear-cut. :/

Will the x86 pipelines be happier with a store to a register than a load of an address? Or put another way, is the cost of movabsq <immediate>,%r10 higher if it gets speculatively executed compared to say a load instruction? Or should I not even worry about this because the odds of the instruction being decoded/executed is very small (since it's aligned to 16 byte boundaries away from the end of the function)?

dberris updated this revision to Diff 108474.Jul 27 2017, 8:25 AM
  • fixup: use %r10 not %r10d
dberris updated this revision to Diff 108477.Jul 27 2017, 8:29 AM
dberris edited the summary of this revision. (Show Details)

(reword)

dberris updated this revision to Diff 109047.Jul 31 2017, 9:20 PM
  • Add some nops after the synthesized instruction
chandlerc edited edge metadata.Jul 31 2017, 10:26 PM

I'd like to revisit the goal here:

Is this a performance fix by avoiding an unknown instruction that gets decoded in some cases? Or is there a correctness issue? (I think you mentioned something about linking having trouble w/ this?)

I can't think of a better instruction if you need all the bits. movabsq is the only instruciton that takes a 64-bit immediate. The only other option might be one of the loads into al/ax/eax/rax from a 64-bit absolute address.

I see. I think writing to the scratch register from outside of the function bodies might be more "benign" if it ever gets executed (compared to a read into one of the other registers). Although I'm not a security expert, so maybe the tradeoff here isn't as clear-cut. :/

Will the x86 pipelines be happier with a store to a register than a load of an address? Or put another way, is the cost of movabsq <immediate>,%r10 higher if it gets speculatively executed compared to say a load instruction? Or should I not even worry about this because the odds of the instruction being decoded/executed is very small (since it's aligned to 16 byte boundaries away from the end of the function)?

Emitting a load sounds really bad here. Honestly, even movabsq seems a bit bad if you're worried about performance.

To help with performance I would pad this out with nops that we know decode *fast* (as in, not into micro-ops). If you want to be defensive about ROP gadgets, I'd emit a seld of int3 instructions followed by whatever you want.

The interesting bit (as indicated above) is whether you can just guard the raw data with a nop/int3 sled, or if you actually need a valid instruction. If you *do* need a valid instruction, I'd like to understand why. We have a reasonable number of places that slap raw data above a function section...

dberris updated this revision to Diff 109060.Aug 1 2017, 1:29 AM

Use different instructions for padding.

dberris updated this revision to Diff 109061.Aug 1 2017, 1:33 AM
dberris edited the summary of this revision. (Show Details)

(reword)

I'd like to revisit the goal here:

Is this a performance fix by avoiding an unknown instruction that gets decoded in some cases? Or is there a correctness issue? (I think you mentioned something about linking having trouble w/ this?)

This is mainly for performance reasons, one that's speculative too. It's an attempt to align whatever is coming next after the synthetic reference to 16-bytes, so I was looking for a way to do this with "valid" instructions.

I can't think of a better instruction if you need all the bits. movabsq is the only instruciton that takes a 64-bit immediate. The only other option might be one of the loads into al/ax/eax/rax from a 64-bit absolute address.

I see. I think writing to the scratch register from outside of the function bodies might be more "benign" if it ever gets executed (compared to a read into one of the other registers). Although I'm not a security expert, so maybe the tradeoff here isn't as clear-cut. :/

Will the x86 pipelines be happier with a store to a register than a load of an address? Or put another way, is the cost of movabsq <immediate>,%r10 higher if it gets speculatively executed compared to say a load instruction? Or should I not even worry about this because the odds of the instruction being decoded/executed is very small (since it's aligned to 16 byte boundaries away from the end of the function)?

Emitting a load sounds really bad here. Honestly, even movabsq seems a bit bad if you're worried about performance.

To help with performance I would pad this out with nops that we know decode *fast* (as in, not into micro-ops). If you want to be defensive about ROP gadgets, I'd emit a seld of int3 instructions followed by whatever you want.

The interesting bit (as indicated above) is whether you can just guard the raw data with a nop/int3 sled, or if you actually need a valid instruction. If you *do* need a valid instruction, I'd like to understand why. We have a reasonable number of places that slap raw data above a function section...

I settled on an INT3 then a 7-byte sled, then the reference. This is so that we can mitigate as much as possible the off-chance that the synthetic reference would be decoded and speculatively executed.

Thanks, Chandler!

PTAL

dberris abandoned this revision.Aug 29 2017, 11:47 PM

This is no longer necessary, now that we've found a way to keep the sections alive without the use of a synthetic reference.