Page MenuHomePhabricator

Initial AArch64 SLH implementation.

Authored by kristof.beyls on Dec 20 2018, 6:47 AM.



This is an initial implementation for Speculative Load Hardening for
AArch64. It builds on top of the recently introduced
AArch64SpeculationHardening pass.
This doesn't implement (yet) some of the optimizations implemented for
the X86SpeculativeLoadHardening pass. I thought introducing the
optimizations incrementally in follow-up patches should make this easier
to review.

Diff Detail


Event Timeline

kristof.beyls created this revision.Dec 20 2018, 6:47 AM

Fix 2 issues flagged by -verify-machineinstrs when testing this patch on more code.

olista01 added inline comments.Jan 8 2019, 7:44 AM
604 ↗(On Diff #180468)

Why do you need to insert pseudo-instructions here, only to replace them with ANDs later on? Could this loop be moved to after the control-flow tracking has been inserted, and create the ANDs directly?

kristof.beyls added inline comments.Jan 8 2019, 8:11 AM
604 ↗(On Diff #180468)

This design inserting pseudo-instructions is based on an earlier design I did (which did not get committed) to implement the intrinsics based approach that is also implemented in gcc.
By keeping this design, it will be easier in the future to also support the intrinsics based approach (as documented from a user point-of-view at
The idea being that the user-specified intrinsics will be lowered to the pseudo-instruction, and that SLH basically inserts the pseudo-instructions. These pseudo-instructions then get lowered in the same way no matter if they came from a user-written intrinsic or from an automatically inserted pseudo-instruction by SLH.
Granted, maybe the only non-trivial part of lowering the pseudo-instruction is the algorithm to optimize/reduce the number of CSDB instructions that are inserted.

Obviously, we could not use the pseudo-instructions for now and only introduce them if we introduce the intrinsics-based approach too.
However, I'm not sure in how far that will complicate the optimization reducing the number of CSDBs inserted.
Let me look into how easy or complicated the alternative design without pseudo-instructions would be.

kristof.beyls marked 3 inline comments as done.Jan 9 2019, 1:41 AM
kristof.beyls added inline comments.
604 ↗(On Diff #180468)

I've know looked into a design where the pass iterates over each basic block just once, inserting masking operations and csdb instructions in a single go.
It turns out that the logic to implement the optimized csdb insertion becomes much more fiddly to implement correctly (I'm not sure I've managed to implement it correctly when trying, even after a few iterations of fixing failing regression tests). An extra complexity with the insertion of csdb instructions is that there isn't an easy way to test that it has been done correctly. When csdb instructions aren't inserted in the places it should, the programs will still execute and produce results as expected.

For that reason, I'd prefer to keep the design where we iterate each basic block twice: once for deciding which registers must be masked at which program locations and once to insert CSDBs. That way, we can implement the CSDB instruction insertion separately from the other transformations in this pass and it is less likely there will be bugs in that implementation.

As to the use of pseudo instructions: we could not use pseudo instructions and encode the same information as the pseudo instructions in side data structures in the pass to convey information between the multiple iterations across a basic block. But that again might result in a more complex implementation. And furthermore, when we implement support for user-facing intrinsics, we'll still need those pseudos anyway.

Therefore, I think the current design overall is a better trade-off.

olista01 accepted this revision.Jan 9 2019, 3:24 AM

LGTM, thanks.

604 ↗(On Diff #180468)

My concern was that the two-phase implementation was more complex than doing it on one phase, but if it simplifies the CSDB generation then it makes sense.

I agree that using pseudo-instructions is better than storing the same information in a side data structure, and as you say we'll likely need them anyway to implement user-facing intrinsics.

This revision is now accepted and ready to land.Jan 9 2019, 3:24 AM
Closed by commit rL350729: Initial AArch64 SLH implementation. (authored by kbeyls, committed by ). · Explain WhyJan 9 2019, 7:17 AM
This revision was automatically updated to reflect the committed changes.
zbrid added a comment.Jan 15 2019, 3:43 PM

Hi Kristof,

Thanks for writing this code! I had an easy time understanding what you were doing.

I was discussing this SLH implementation with Chandler and he suggested that it may be useful to write a design doc similar to the one for the x86 implementation, so other people in the community can understand and discuss the current/future design with the ARM specific details laid out. What do you think?


Right now, this masks after every load in a program. Is one of the future optimizations you mention in the comment in this file to mask only after loads that are depended upon by later non-data invariant operations?


I'm not sure if this case is possible, but is it possible that some defs are GPR and some aren't? If that's possible would it be worthwhile to harden the GPRs and non-GPRs in different ways? It appears to me that currently if any are non-GPR, then all of them are treated as non-GPRs.

kristof.beyls marked 3 inline comments as done.Jan 16 2019, 5:15 AM

Hi Kristof,

Thanks for writing this code! I had an easy time understanding what you were doing.

I was discussing this SLH implementation with Chandler and he suggested that it may be useful to write a design doc similar to the one for the x86 implementation, so other people in the community can understand and discuss the current/future design with the ARM specific details laid out. What do you think?

I agree it would be good to have such a design doc. Actually I think it would be best to adapt and separate out x86 and aarch64-specific implementation aspects from the target independent concepts there.
I am happy to try and do that, but will definitely not have time to do so this week. Maybe later this month.
I'd also be happy if e.g. you yourself would like to work on that.


Rereading the FIXMEs I've written in this function: no, that is not recorded as a potential future optimization.
Out of interest, is this optimization described in the design document at somewhere?
I haven't thought about that optimization in detail yet and it seems a bit unclear to me why it is always safe to perform such an optimization.


Yeah, something like
LDR D0, [X0], #8
would load a value into D0 and update the address in X0, so def-ing both D0 and X0.
However, with current code, X0 would be masked ("the address loaded from"/HardenAddressLoadedFrom).
If you already harden the address loaded from, there is no need to also harden the loaded data.
If you already have to harden the address loaded from, my guess is that there is not much opportunity for further optimizing/reducing the overhead of that masking.
But I am happy to be proven wrong - I haven't thought this through in detail.

zbrid added a comment.Jan 16 2019, 7:54 AM

Sounds good. I have a bit of time and can start working on updating the design doc to make it more target independent outside of the implementation section.


I was trying to describe this section and the following section which is a bit different than what I said. Sorry about that.

I'm not sure if the assumptions for those implementation detail sections hold for ARM.