This is an archive of the discontinued LLVM Phabricator instance.

Differential D55929

Initial AArch64 SLH implementation.
ClosedPublic

Authored by kristof.beyls on Dec 20 2018, 6:47 AM.

Download Raw Diff

Details

Reviewers

olista01

Commits

rGc650ff77eb10: Initial AArch64 SLH implementation.
rL350729: Initial AArch64 SLH implementation.

Summary

This is an initial implementation for Speculative Load Hardening for
AArch64. It builds on top of the recently introduced
AArch64SpeculationHardening pass.
This doesn't implement (yet) some of the optimizations implemented for
the X86SpeculativeLoadHardening pass. I thought introducing the
optimizations incrementally in follow-up patches should make this easier
to review.

Diff Detail

Event Timeline

kristof.beyls created this revision.Dec 20 2018, 6:47 AM

Herald added a subscriber: javed.absar. · View Herald TranscriptDec 20 2018, 6:47 AM

Fix 2 issues flagged by -verify-machineinstrs when testing this patch on more code.

olista01 added inline comments.Jan 8 2019, 7:44 AM

lib/Target/AArch64/AArch64SpeculationHardening.cpp
583–590	Why do you need to insert pseudo-instructions here, only to replace them with ANDs later on? Could this loop be moved to after the control-flow tracking has been inserted, and create the ANDs directly?

kristof.beyls added inline comments.Jan 8 2019, 8:11 AM

lib/Target/AArch64/AArch64SpeculationHardening.cpp
583–590	This design inserting pseudo-instructions is based on an earlier design I did (which did not get committed) to implement the intrinsics based approach that is also implemented in gcc. By keeping this design, it will be easier in the future to also support the intrinsics based approach (as documented from a user point-of-view at https://lwn.net/Articles/759423/). The idea being that the user-specified intrinsics will be lowered to the pseudo-instruction, and that SLH basically inserts the pseudo-instructions. These pseudo-instructions then get lowered in the same way no matter if they came from a user-written intrinsic or from an automatically inserted pseudo-instruction by SLH. Granted, maybe the only non-trivial part of lowering the pseudo-instruction is the algorithm to optimize/reduce the number of CSDB instructions that are inserted. Obviously, we could not use the pseudo-instructions for now and only introduce them if we introduce the intrinsics-based approach too. However, I'm not sure in how far that will complicate the optimization reducing the number of CSDBs inserted. Let me look into how easy or complicated the alternative design without pseudo-instructions would be.

kristof.beyls marked 3 inline comments as done.Jan 9 2019, 1:41 AM

kristof.beyls added inline comments.

lib/Target/AArch64/AArch64SpeculationHardening.cpp
583–590	I've know looked into a design where the pass iterates over each basic block just once, inserting masking operations and csdb instructions in a single go. It turns out that the logic to implement the optimized csdb insertion becomes much more fiddly to implement correctly (I'm not sure I've managed to implement it correctly when trying, even after a few iterations of fixing failing regression tests). An extra complexity with the insertion of csdb instructions is that there isn't an easy way to test that it has been done correctly. When csdb instructions aren't inserted in the places it should, the programs will still execute and produce results as expected. For that reason, I'd prefer to keep the design where we iterate each basic block twice: once for deciding which registers must be masked at which program locations and once to insert CSDBs. That way, we can implement the CSDB instruction insertion separately from the other transformations in this pass and it is less likely there will be bugs in that implementation. As to the use of pseudo instructions: we could not use pseudo instructions and encode the same information as the pseudo instructions in side data structures in the pass to convey information between the multiple iterations across a basic block. But that again might result in a more complex implementation. And furthermore, when we implement support for user-facing intrinsics, we'll still need those pseudos anyway. Therefore, I think the current design overall is a better trade-off.

LGTM, thanks.

lib/Target/AArch64/AArch64SpeculationHardening.cpp
583–590	My concern was that the two-phase implementation was more complex than doing it on one phase, but if it simplifies the CSDB generation then it makes sense. I agree that using pseudo-instructions is better than storing the same information in a side data structure, and as you say we'll likely need them anyway to implement user-facing intrinsics.

This revision is now accepted and ready to land.Jan 9 2019, 3:24 AM

Closed by commit rL350729: Initial AArch64 SLH implementation. (authored by kbeyls). · Explain WhyJan 9 2019, 7:17 AM

This revision was automatically updated to reflect the committed changes.

Hi Kristof,

Thanks for writing this code! I had an easy time understanding what you were doing.

I was discussing this SLH implementation with Chandler and he suggested that it may be useful to write a design doc similar to the one for the x86 implementation, so other people in the community can understand and discuss the current/future design with the ARM specific details laid out. What do you think?

llvm/trunk/lib/Target/AArch64/AArch64SpeculationHardening.cpp
411 ↗	(On Diff #180840)	Right now, this masks after every load in a program. Is one of the future optimizations you mention in the comment in this file to mask only after loads that are depended upon by later non-data invariant operations?
447 ↗	(On Diff #180840)	I'm not sure if this case is possible, but is it possible that some defs are GPR and some aren't? If that's possible would it be worthwhile to harden the GPRs and non-GPRs in different ways? It appears to me that currently if any are non-GPR, then all of them are treated as non-GPRs.

In D55929#1358920, @zbrid wrote:

Hi Kristof,

Thanks for writing this code! I had an easy time understanding what you were doing.

I was discussing this SLH implementation with Chandler and he suggested that it may be useful to write a design doc similar to the one for the x86 implementation, so other people in the community can understand and discuss the current/future design with the ARM specific details laid out. What do you think?

I agree it would be good to have such a design doc. Actually I think it would be best to adapt https://llvm.org/docs/SpeculativeLoadHardening.html and separate out x86 and aarch64-specific implementation aspects from the target independent concepts there.
I am happy to try and do that, but will definitely not have time to do so this week. Maybe later this month.
I'd also be happy if e.g. you yourself would like to work on that.

llvm/trunk/lib/Target/AArch64/AArch64SpeculationHardening.cpp
411 ↗	(On Diff #180840)	Rereading the FIXMEs I've written in this function: no, that is not recorded as a potential future optimization. Out of interest, is this optimization described in the design document at https://llvm.org/docs/SpeculativeLoadHardening.html somewhere? I haven't thought about that optimization in detail yet and it seems a bit unclear to me why it is always safe to perform such an optimization.
447 ↗	(On Diff #180840)	Yeah, something like LDR D0, [X0], #8 would load a value into D0 and update the address in X0, so def-ing both D0 and X0. However, with current code, X0 would be masked ("the address loaded from"/HardenAddressLoadedFrom). If you already harden the address loaded from, there is no need to also harden the loaded data. If you already have to harden the address loaded from, my guess is that there is not much opportunity for further optimizing/reducing the overhead of that masking. But I am happy to be proven wrong - I haven't thought this through in detail.

Sounds good. I have a bit of time and can start working on updating the design doc to make it more target independent outside of the implementation section.

llvm/trunk/lib/Target/AArch64/AArch64SpeculationHardening.cpp
411 ↗	(On Diff #180840)	I was trying to describe this section and the following section which is a bit different than what I said. Sorry about that. https://llvm.org/docs/SpeculativeLoadHardening.html#loads-folded-into-data-invariant-operations-can-be-hardened-after-the-operation I'm not sure if the assumptions for those implementation detail sections hold for ARM.

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64InstrInfo.cpp

5 lines

AArch64InstrInfo.td

9 lines

AArch64SpeculationHardening.cpp

281 lines

test/

CodeGen/

AArch64/

speculation-hardening-loads.ll

100 lines

Diff 179066

lib/Target/AArch64/AArch64InstrInfo.cpp

	Show First 20 Lines • Show All 959 Lines • ▼ Show 20 Lines
	}			}

	bool AArch64InstrInfo::isSchedulingBoundary(const MachineInstr &MI,			bool AArch64InstrInfo::isSchedulingBoundary(const MachineInstr &MI,
	const MachineBasicBlock *MBB,			const MachineBasicBlock *MBB,
	const MachineFunction &MF) const {			const MachineFunction &MF) const {
	if (TargetInstrInfo::isSchedulingBoundary(MI, MBB, MF))			if (TargetInstrInfo::isSchedulingBoundary(MI, MBB, MF))
	return true;			return true;
	switch (MI.getOpcode()) {			switch (MI.getOpcode()) {
				case AArch64::HINT:
				// CSDB hints are scheduling barriers.
				if (MI.getOperand(0).getImm() == 0x14)
				return true;
				break;
	case AArch64::DSB:			case AArch64::DSB:
	case AArch64::ISB:			case AArch64::ISB:
	// DSB and ISB also are scheduling barriers.			// DSB and ISB also are scheduling barriers.
	return true;			return true;
	default:;			default:;
	}			}
	return isSEHInstruction(MI);			return isSEHInstruction(MI);
	}			}
	▲ Show 20 Lines • Show All 4,560 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 366 Lines • ▼ Show 20 Lines
	def AArch64sitof: SDNode<"AArch64ISD::SITOF", SDT_AArch64ITOF>;			def AArch64sitof: SDNode<"AArch64ISD::SITOF", SDT_AArch64ITOF>;
	def AArch64uitof: SDNode<"AArch64ISD::UITOF", SDT_AArch64ITOF>;			def AArch64uitof: SDNode<"AArch64ISD::UITOF", SDT_AArch64ITOF>;

	def AArch64tlsdesc_callseq : SDNode<"AArch64ISD::TLSDESC_CALLSEQ",			def AArch64tlsdesc_callseq : SDNode<"AArch64ISD::TLSDESC_CALLSEQ",
	SDT_AArch64TLSDescCallSeq,			SDT_AArch64TLSDescCallSeq,
	[SDNPInGlue, SDNPOutGlue, SDNPHasChain,			[SDNPInGlue, SDNPOutGlue, SDNPHasChain,
	SDNPVariadic]>;			SDNPVariadic]>;


	def AArch64WrapperLarge : SDNode<"AArch64ISD::WrapperLarge",			def AArch64WrapperLarge : SDNode<"AArch64ISD::WrapperLarge",
	SDT_AArch64WrapperLarge>;			SDT_AArch64WrapperLarge>;

	def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;			def AArch64NvCast : SDNode<"AArch64ISD::NVCAST", SDTUnaryOp>;

	def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,			def SDT_AArch64mull : SDTypeProfile<1, 2, [SDTCisInt<0>, SDTCisInt<1>,
	SDTCisSameAs<1, 2>]>;			SDTCisSameAs<1, 2>]>;
	def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull>;			def AArch64smull : SDNode<"AArch64ISD::SMULL", SDT_AArch64mull>;
	▲ Show 20 Lines • Show All 131 Lines • ▼ Show 20 Lines
	// algorithms. Immediate operand is the number of bytes this "instruction"			// algorithms. Immediate operand is the number of bytes this "instruction"
	// occupies; register operands can be used to enforce dependency and constrain			// occupies; register operands can be used to enforce dependency and constrain
	// the scheduler.			// the scheduler.
	let hasSideEffects = 1, mayLoad = 1, mayStore = 1 in			let hasSideEffects = 1, mayLoad = 1, mayStore = 1 in
	def SPACE : Pseudo<(outs GPR64:$Rd), (ins i32imm:$size, GPR64:$Rn),			def SPACE : Pseudo<(outs GPR64:$Rd), (ins i32imm:$size, GPR64:$Rn),
	[(set GPR64:$Rd, (int_aarch64_space imm:$size, GPR64:$Rn))]>,			[(set GPR64:$Rd, (int_aarch64_space imm:$size, GPR64:$Rn))]>,
	Sched<[]>;			Sched<[]>;

				let hasSideEffects = 1, isCodeGenOnly = 1 in {
				def SpeculationSafeValueX
				: Pseudo<(outs GPR64:$dst), (ins GPR64:$src), []>, Sched<[]>;
				def SpeculationSafeValueW
				: Pseudo<(outs GPR32:$dst), (ins GPR32:$src), []>, Sched<[]>;
				}


	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// System instructions.			// System instructions.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	def HINT : HintI<"hint">;			def HINT : HintI<"hint">;
	def : InstAlias<"nop", (HINT 0b000)>;			def : InstAlias<"nop", (HINT 0b000)>;
	def : InstAlias<"yield",(HINT 0b001)>;			def : InstAlias<"yield",(HINT 0b001)>;
	def : InstAlias<"wfe", (HINT 0b010)>;			def : InstAlias<"wfe", (HINT 0b010)>;
	▲ Show 20 Lines • Show All 6,263 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64SpeculationHardening.cpp

Show All 10 Lines
// vulnerabilities that may happen under control flow miss-speculation.		// vulnerabilities that may happen under control flow miss-speculation.
//		//
// The pass implements tracking of control flow miss-speculation into a "taint"		// The pass implements tracking of control flow miss-speculation into a "taint"
// register. That taint register can then be used to mask off registers with		// register. That taint register can then be used to mask off registers with
// sensitive data when executing under miss-speculation, a.k.a. "transient		// sensitive data when executing under miss-speculation, a.k.a. "transient
// execution".		// execution".
// This pass is aimed at mitigating against SpectreV1-style vulnarabilities.		// This pass is aimed at mitigating against SpectreV1-style vulnarabilities.
//		//
// At the moment, it implements the tracking of miss-speculation of control		// It also implements speculative load hardening, i.e. using the taint register
// flow into a taint register, but doesn't implement a mechanism yet to then		// to automatically mask off loaded data.
// use that taint register to mask of vulnerable data in registers (something
// for a follow-on improvement). Possible strategies to mask out vulnerable
// data that can be implemented on top of this are:
// - speculative load hardening to automatically mask of data loaded
// in registers.
// - using intrinsics to mask of data in registers as indicated by the
// programmer (see https://lwn.net/Articles/759423/).
//		//
// For AArch64, the following implementation choices are made below.		// As a possible follow-on improvement, also an intrinsics-based approach as
		// explained at https://lwn.net/Articles/759423/ could be implemented on top of
		// the current design.
		//
		// For AArch64, the following implementation choices are made to implement the
		// tracking of control flow miss-speculation into a taint register:
// Some of these are different than the implementation choices made in		// Some of these are different than the implementation choices made in
// the similar pass implemented in X86SpeculativeLoadHardening.cpp, as		// the similar pass implemented in X86SpeculativeLoadHardening.cpp, as
// the instruction set characteristics result in different trade-offs.		// the instruction set characteristics result in different trade-offs.
// - The speculation hardening is done after register allocation. With a		// - The speculation hardening is done after register allocation. With a
// relative abundance of registers, one register is reserved (X16) to be		// relative abundance of registers, one register is reserved (X16) to be
// the taint register. X16 is expected to not clash with other register		// the taint register. X16 is expected to not clash with other register
// reservation mechanisms with very high probability because:		// reservation mechanisms with very high probability because:
// . The AArch64 ABI doesn't guarantee X16 to be retained across any call.		// . The AArch64 ABI doesn't guarantee X16 to be retained across any call.
Show All 22 Lines
// This means that the conditional branches must not be implemented as one		// This means that the conditional branches must not be implemented as one
// of the AArch64 conditional branches that do not use the flags as input		// of the AArch64 conditional branches that do not use the flags as input
// (CB(N)Z and TB(N)Z). This is implemented by ensuring in the instruction		// (CB(N)Z and TB(N)Z). This is implemented by ensuring in the instruction
// selectors to not produce these instructions when speculation hardening		// selectors to not produce these instructions when speculation hardening
// is enabled. This pass will assert if it does encounter such an instruction.		// is enabled. This pass will assert if it does encounter such an instruction.
// - On function call boundaries, the miss-speculation state is transferred from		// - On function call boundaries, the miss-speculation state is transferred from
// the taint register X16 to be encoded in the SP register as value 0.		// the taint register X16 to be encoded in the SP register as value 0.
//		//
		// For the aspect of automatically hardening loads, using the taint register,
		// (a.k.a. speculative load hardening, see
		// https://llvm.org/docs/SpeculativeLoadHardening.html), the following
		// implementation choices are made for AArch64:
		// - Many of the optimizations described at
		// https://llvm.org/docs/SpeculativeLoadHardening.html to harden fewer
		// loads haven't been implemented yet - but for some of them there are
		// FIXMEs in the code.
		// - loads that load into general purpose (X or W) registers get hardened by
		// masking the loaded data. For loads that load into other registers, the
		// address loaded from gets hardened. It is expected that hardening the
		// loaded data may be more efficient; but masking data in registers other
		// than X or W is not easy and may result in being slower than just
		// hardening the X address register loaded from.
		// - On AArch64, CSDB instructions are inserted between the masking of the
		// register and its first use, to ensure there's no non-control-flow
		// speculation that might undermine the hardening mechanism.
		//
// Future extensions/improvements could be:		// Future extensions/improvements could be:
// - Implement this functionality using full speculation barriers, akin to the		// - Implement this functionality using full speculation barriers, akin to the
// x86-slh-lfence option. This may be more useful for the intrinsics-based		// x86-slh-lfence option. This may be more useful for the intrinsics-based
// approach than for the SLH approach to masking.		// approach than for the SLH approach to masking.
// Note that this pass already inserts the full speculation barriers if the		// Note that this pass already inserts the full speculation barriers if the
// function for some niche reason makes use of X16/W16.		// function for some niche reason makes use of X16/W16.
// - no indirect branch misprediction gets protected/instrumented; but this		// - no indirect branch misprediction gets protected/instrumented; but this
// could be done for some indirect branches, such as switch jump tables.		// could be done for some indirect branches, such as switch jump tables.
Show All 18 Lines
#include <cassert>		#include <cassert>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "aarch64-speculation-hardening"		#define DEBUG_TYPE "aarch64-speculation-hardening"

#define AARCH64_SPECULATION_HARDENING_NAME "AArch64 speculation hardening pass"		#define AARCH64_SPECULATION_HARDENING_NAME "AArch64 speculation hardening pass"

		cl::opt<bool> HardenLoads("aarch64-slh-loads", cl::Hidden,
		cl::desc("Sanitize loads from memory."),
		cl::init(true));

namespace {		namespace {

class AArch64SpeculationHardening : public MachineFunctionPass {		class AArch64SpeculationHardening : public MachineFunctionPass {
public:		public:
const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;

static char ID;		static char ID;

AArch64SpeculationHardening() : MachineFunctionPass(ID) {		AArch64SpeculationHardening() : MachineFunctionPass(ID) {
initializeAArch64SpeculationHardeningPass(*PassRegistry::getPassRegistry());		initializeAArch64SpeculationHardeningPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &Fn) override;		bool runOnMachineFunction(MachineFunction &Fn) override;

StringRef getPassName() const override {		StringRef getPassName() const override {
return AARCH64_SPECULATION_HARDENING_NAME;		return AARCH64_SPECULATION_HARDENING_NAME;
}		}

private:		private:
unsigned MisspeculatingTaintReg;		unsigned MisspeculatingTaintReg;
		unsigned MisspeculatingTaintReg32Bit;
bool UseControlFlowSpeculationBarrier;		bool UseControlFlowSpeculationBarrier;
		BitVector RegsNeedingCSDBBeforeUse;
		BitVector RegsAlreadyMasked;

bool functionUsesHardeningRegister(MachineFunction &MF) const;		bool functionUsesHardeningRegister(MachineFunction &MF) const;
bool instrumentControlFlow(MachineBasicBlock &MBB);		bool instrumentControlFlow(MachineBasicBlock &MBB);
bool endsWithCondControlFlow(MachineBasicBlock &MBB, MachineBasicBlock *&TBB,		bool endsWithCondControlFlow(MachineBasicBlock &MBB, MachineBasicBlock *&TBB,
MachineBasicBlock *&FBB,		MachineBasicBlock *&FBB,
AArch64CC::CondCode &CondCode) const;		AArch64CC::CondCode &CondCode) const;
void insertTrackingCode(MachineBasicBlock &SplitEdgeBB,		void insertTrackingCode(MachineBasicBlock &SplitEdgeBB,
AArch64CC::CondCode &CondCode, DebugLoc DL) const;		AArch64CC::CondCode &CondCode, DebugLoc DL) const;
void insertSPToRegTaintPropagation(MachineBasicBlock *MBB,		void insertSPToRegTaintPropagation(MachineBasicBlock *MBB,
MachineBasicBlock::iterator MBBI) const;		MachineBasicBlock::iterator MBBI) const;
void insertRegToSPTaintPropagation(MachineBasicBlock *MBB,		void insertRegToSPTaintPropagation(MachineBasicBlock *MBB,
MachineBasicBlock::iterator MBBI,		MachineBasicBlock::iterator MBBI,
unsigned TmpReg) const;		unsigned TmpReg) const;

		bool slhLoads(MachineBasicBlock &MBB);
		bool makeGPRSpeculationSafe(MachineBasicBlock &MBB,
		MachineBasicBlock::iterator MBBI,
		MachineInstr &MI, unsigned Reg);
		bool lowerSpeculationSafeValuePseudos(MachineBasicBlock &MBB);
		bool expandSpeculationSafeValue(MachineBasicBlock &MBB,
		MachineBasicBlock::iterator MBBI);
		bool insertCSDB(MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI,
		DebugLoc DL);
};		};

} // end anonymous namespace		} // end anonymous namespace

char AArch64SpeculationHardening::ID = 0;		char AArch64SpeculationHardening::ID = 0;

INITIALIZE_PASS(AArch64SpeculationHardening, "aarch64-speculation-hardening",		INITIALIZE_PASS(AArch64SpeculationHardening, "aarch64-speculation-hardening",
AARCH64_SPECULATION_HARDENING_NAME, false, false)		AARCH64_SPECULATION_HARDENING_NAME, false, false)
▲ Show 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	for (MachineInstr &MI : MBB) {
if (MI.readsRegister(MisspeculatingTaintReg, TRI) \|\|		if (MI.readsRegister(MisspeculatingTaintReg, TRI) \|\|
MI.modifiesRegister(MisspeculatingTaintReg, TRI))		MI.modifiesRegister(MisspeculatingTaintReg, TRI))
return true;		return true;
}		}
}		}
return false;		return false;
}		}

		// Make GPR register Reg speculation-safe by putting it through the
		// SpeculationSafeValue pseudo instruction, if we can't prove that
		// the value in the register has already been hardened.
		bool AArch64SpeculationHardening::makeGPRSpeculationSafe(
		MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI, MachineInstr &MI,
		unsigned Reg) {
		assert(AArch64::GPR32allRegClass.contains(Reg) \|\|
		AArch64::GPR64allRegClass.contains(Reg));

		// Loads cannot directly load a value into the SP (nor WSP).
		// Therefore, if Reg is SP or WSP, it is because the instruction loads from
		// the stack through the stack pointer.
		//
		// Since the stack pointer is never dynamically controllable, don't harden it.
		if (Reg == AArch64::SP \|\| Reg == AArch64::WSP)
		return false;

		// Do not harden the register again if already hardened before.
		if (RegsAlreadyMasked[Reg])
		return false;

		const bool Is64Bit = AArch64::GPR64allRegClass.contains(Reg);
		LLVM_DEBUG(dbgs() << "About to harden register : " << Reg << "\n");
		BuildMI(MBB, MBBI, MI.getDebugLoc(),
		TII->get(Is64Bit ? AArch64::SpeculationSafeValueX
		: AArch64::SpeculationSafeValueW))
		.addDef(Reg)
		.addUse(Reg);
		RegsAlreadyMasked.set(Reg);
		return true;
		}

		bool AArch64SpeculationHardening::slhLoads(MachineBasicBlock &MBB) {
		bool Modified = false;

		LLVM_DEBUG(dbgs() << "slhLoads running on MBB: " << MBB);

		RegsAlreadyMasked.reset();

		MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
		MachineBasicBlock::iterator NextMBBI;
		for (; MBBI != E; MBBI = NextMBBI) {
		MachineInstr &MI = *MBBI;
		NextMBBI = std::next(MBBI);
		// Only harden loaded values or addresses used in loads.
		if (!MI.mayLoad())
		continue;

		LLVM_DEBUG(dbgs() << "About to harden: " << MI);

		// For general purpose register loads, harden the registers loaded into.
		// For other loads, harden the address loaded from.
		// Masking the loaded value is expected to result in less performance
		// overhead, as the load can still execute speculatively in comparison to
		// when the address loaded from gets masked. However, masking is only
		// easy to do efficiently on GPR registers, so for loads into non-GPR
		// registers (e.g. floating point loads), mask the address loaded from.
		bool AllDefsAreGPR = llvm::all_of(MI.defs(), [&](MachineOperand &Op) {
		return Op.isReg() && (AArch64::GPR32allRegClass.contains(Op.getReg()) \|\|
		AArch64::GPR64allRegClass.contains(Op.getReg()));
		});
		// FIXME: it might be a worthwhile optimization to not mask loaded
		// values if all the registers involved in address calculation are already
		// hardened, leading to this load not able to execute on a miss-speculated
		// path.
		bool HardenLoadedData = AllDefsAreGPR;
		bool HardenAddressLoadedFrom = !HardenLoadedData;

		// First remove registers from AlreadyMaskedRegisters if their value is
		// updated by this instruction - it makes them contain a new value that is
		// not guaranteed to already have been masked.
		for (MachineOperand Op : MI.defs())
		for (MCRegAliasIterator AI(Op.getReg(), TRI, true); AI.isValid(); ++AI)
		RegsAlreadyMasked.reset(*AI);

		// FIXME: loads from the stack with an immediate offset from the stack
		// pointer probably shouldn't be hardened, which could result in a
		// significant optimization. See section "Don’t check loads from
		// compile-time constant stack offsets", in
		// https://llvm.org/docs/SpeculativeLoadHardening.html

		if (HardenLoadedData)
		for (auto Def : MI.defs())
		// FIXME: For pre/post-increment addressing modes, the base register
		// used in address calculation is also defined by this instruction.
		// It might be a worthwhile optimization to not harden that
		// base register increment/decrement when the increment/decrement is
		// an immediate.
		Modified \|= makeGPRSpeculationSafe(MBB, NextMBBI, MI, Def.getReg());

		if (HardenAddressLoadedFrom)
		for (auto Use : MI.uses())
		if (Use.isReg())
		Modified \|= makeGPRSpeculationSafe(MBB, MBBI, MI, Use.getReg());
		}
		return Modified;
		}

		/// \brief If MBBI references a pseudo instruction that should be expanded
		/// here, do the expansion and return true. Otherwise return false.
		bool AArch64SpeculationHardening::expandSpeculationSafeValue(
		MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI) {
		MachineInstr &MI = *MBBI;
		unsigned Opcode = MI.getOpcode();
		bool Is64Bit = true;

		switch (Opcode) {
		default:
		break;
		case AArch64::SpeculationSafeValueW:
		Is64Bit = false;
		LLVM_FALLTHROUGH;
		case AArch64::SpeculationSafeValueX:
		// Just remove the SpeculationSafe pseudo's if control flow
		// miss-speculation
		// isn't happening because we're already inserting barriers to guarantee
		// that.
		if (!UseControlFlowSpeculationBarrier) {
		unsigned DstReg = MI.getOperand(0).getReg();
		unsigned SrcReg = MI.getOperand(1).getReg();
		// Mark this register and all its aliasing registers as needing to be
		// value speculation hardened before its next use, by using a CSDB
		// barrier instruction.
		for (MachineOperand Op : MI.defs())
		for (MCRegAliasIterator AI(Op.getReg(), TRI, true); AI.isValid(); ++AI)
		RegsNeedingCSDBBeforeUse.set(*AI);

		// Mask off with taint state.
		BuildMI(MBB, MBBI, MI.getDebugLoc(),
		Is64Bit ? TII->get(AArch64::ANDXrs) : TII->get(AArch64::ANDWrs))
		.addDef(DstReg)
		.addUse(SrcReg, RegState::Kill)
		.addUse(Is64Bit ? MisspeculatingTaintReg
		: MisspeculatingTaintReg32Bit)
		.addImm(0);
		}
		MI.eraseFromParent();
		return true;
		}
		return false;
		}

		bool AArch64SpeculationHardening::insertCSDB(MachineBasicBlock &MBB,
		MachineBasicBlock::iterator MBBI,
		DebugLoc DL) {
		assert(!UseControlFlowSpeculationBarrier && "No need to insert CSDBs when "
		"control flow miss-speculation "
		"is already blocked");
		// insert data value speculation barrier (CSDB)
		BuildMI(MBB, MBBI, DL, TII->get(AArch64::HINT)).addImm(0x14);
		RegsNeedingCSDBBeforeUse.reset();
		return true;
		}

		bool AArch64SpeculationHardening::lowerSpeculationSafeValuePseudos(
		MachineBasicBlock &MBB) {
		bool Modified = false;

		RegsNeedingCSDBBeforeUse.reset();

		// The following loop iterates over all instructions in the basic block,
		// and performs 2 operations:
		// 1. Insert a CSDB at this location if needed.
		// 2. Expand the SpeculationSafeValuePseudo if the current instruction is one.
		//
		// The insertion of the CSDB is done as late as possible (i.e. just before
		// the use of a masked register), in the hope that that will reduce the
		// total number of CSDBs in a block when there are multiple masked registers
		// in the block.
		MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
		DebugLoc DL;
		while (MBBI != E) {
		MachineInstr &MI = *MBBI;
		DL = MI.getDebugLoc();
		MachineBasicBlock::iterator NMBBI = std::next(MBBI);

		// First check if a CSDB needs to be inserted due to earlier registers
		// that were masked and that are used by the next instruction.
		// Also emit the barrier on any potential control flow changes.
		bool NeedToEmitBarrier = false;
		if (RegsNeedingCSDBBeforeUse.any() && (MI.isCall() \|\| MI.isTerminator()))
		NeedToEmitBarrier = true;
		if (!NeedToEmitBarrier)
		for (MachineOperand Op : MI.uses())
		if (Op.isReg() && RegsNeedingCSDBBeforeUse[Op.getReg()]) {
		NeedToEmitBarrier = true;
		break;
		}

		if (NeedToEmitBarrier)
		Modified \|= insertCSDB(MBB, MBBI, DL);

		Modified \|= expandSpeculationSafeValue(MBB, MBBI);

		MBBI = NMBBI;
		}

		if (RegsNeedingCSDBBeforeUse.any())
		Modified \|= insertCSDB(MBB, MBBI, DL);

		return Modified;
		}

bool AArch64SpeculationHardening::runOnMachineFunction(MachineFunction &MF) {		bool AArch64SpeculationHardening::runOnMachineFunction(MachineFunction &MF) {
if (!MF.getFunction().hasFnAttribute(Attribute::SpeculativeLoadHardening))		if (!MF.getFunction().hasFnAttribute(Attribute::SpeculativeLoadHardening))
return false;		return false;

MisspeculatingTaintReg = AArch64::X16;		MisspeculatingTaintReg = AArch64::X16;
		MisspeculatingTaintReg32Bit = AArch64::W16;
TII = MF.getSubtarget().getInstrInfo();		TII = MF.getSubtarget().getInstrInfo();
TRI = MF.getSubtarget().getRegisterInfo();		TRI = MF.getSubtarget().getRegisterInfo();
		RegsNeedingCSDBBeforeUse.resize(TRI->getNumRegs());
		RegsAlreadyMasked.resize(TRI->getNumRegs());
		UseControlFlowSpeculationBarrier = functionUsesHardeningRegister(MF);

bool Modified = false;		bool Modified = false;

UseControlFlowSpeculationBarrier = functionUsesHardeningRegister(MF);		// Step 1: Enable automatic insertion of SpeculationSafeValue.
		if (HardenLoads) {
		LLVM_DEBUG(
		dbgs() << "***** AArch64SpeculationHardening - automatic insertion of "
		"SpeculationSafeValue intrinsics *****\n");
		for (auto &MBB : MF)
		Modified \|= slhLoads(MBB);
		}
		olista01Unsubmitted Done Reply Inline Actions Why do you need to insert pseudo-instructions here, only to replace them with ANDs later on? Could this loop be moved to after the control-flow tracking has been inserted, and create the ANDs directly? olista01: Why do you need to insert pseudo-instructions here, only to replace them with ANDs later on?
		kristof.beylsAuthorUnsubmitted Done Reply Inline Actions This design inserting pseudo-instructions is based on an earlier design I did (which did not get committed) to implement the intrinsics based approach that is also implemented in gcc. By keeping this design, it will be easier in the future to also support the intrinsics based approach (as documented from a user point-of-view at https://lwn.net/Articles/759423/). The idea being that the user-specified intrinsics will be lowered to the pseudo-instruction, and that SLH basically inserts the pseudo-instructions. These pseudo-instructions then get lowered in the same way no matter if they came from a user-written intrinsic or from an automatically inserted pseudo-instruction by SLH. Granted, maybe the only non-trivial part of lowering the pseudo-instruction is the algorithm to optimize/reduce the number of CSDB instructions that are inserted. Obviously, we could not use the pseudo-instructions for now and only introduce them if we introduce the intrinsics-based approach too. However, I'm not sure in how far that will complicate the optimization reducing the number of CSDBs inserted. Let me look into how easy or complicated the alternative design without pseudo-instructions would be. kristof.beyls: This design inserting pseudo-instructions is based on an earlier design I did (which did not…
		kristof.beylsAuthorUnsubmitted Done Reply Inline Actions I've know looked into a design where the pass iterates over each basic block just once, inserting masking operations and csdb instructions in a single go. It turns out that the logic to implement the optimized csdb insertion becomes much more fiddly to implement correctly (I'm not sure I've managed to implement it correctly when trying, even after a few iterations of fixing failing regression tests). An extra complexity with the insertion of csdb instructions is that there isn't an easy way to test that it has been done correctly. When csdb instructions aren't inserted in the places it should, the programs will still execute and produce results as expected. For that reason, I'd prefer to keep the design where we iterate each basic block twice: once for deciding which registers must be masked at which program locations and once to insert CSDBs. That way, we can implement the CSDB instruction insertion separately from the other transformations in this pass and it is less likely there will be bugs in that implementation. As to the use of pseudo instructions: we could not use pseudo instructions and encode the same information as the pseudo instructions in side data structures in the pass to convey information between the multiple iterations across a basic block. But that again might result in a more complex implementation. And furthermore, when we implement support for user-facing intrinsics, we'll still need those pseudos anyway. Therefore, I think the current design overall is a better trade-off. kristof.beyls: I've know looked into a design where the pass iterates over each basic block just once…
		olista01Unsubmitted Not Done Reply Inline Actions My concern was that the two-phase implementation was more complex than doing it on one phase, but if it simplifies the CSDB generation then it makes sense. I agree that using pseudo-instructions is better than storing the same information in a side data structure, and as you say we'll likely need them anyway to implement user-facing intrinsics. olista01: My concern was that the two-phase implementation was more complex than doing it on one phase…

// Instrument control flow speculation tracking, if requested.		// 2.a Add instrumentation code to function entry and exits.
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "*** AArch64SpeculationHardening - track control flow ***\n");		<< "*** AArch64SpeculationHardening - track control flow ***\n");

// 1. Add instrumentation code to function entry and exits.
SmallVector<MachineBasicBlock *, 2> EntryBlocks;		SmallVector<MachineBasicBlock *, 2> EntryBlocks;
EntryBlocks.push_back(&MF.front());		EntryBlocks.push_back(&MF.front());
for (const LandingPadInfo &LPI : MF.getLandingPads())		for (const LandingPadInfo &LPI : MF.getLandingPads())
EntryBlocks.push_back(LPI.LandingPadBlock);		EntryBlocks.push_back(LPI.LandingPadBlock);
for (auto Entry : EntryBlocks)		for (auto Entry : EntryBlocks)
insertSPToRegTaintPropagation(		insertSPToRegTaintPropagation(
Entry, Entry->SkipPHIsLabelsAndDebug(Entry->begin()));		Entry, Entry->SkipPHIsLabelsAndDebug(Entry->begin()));

// 2. Add instrumentation code to every basic block.		// 2.b Add instrumentation code to every basic block.
for (auto &MBB : MF)		for (auto &MBB : MF)
Modified \|= instrumentControlFlow(MBB);		Modified \|= instrumentControlFlow(MBB);

		LLVM_DEBUG(dbgs() << "***** AArch64SpeculationHardening - Lowering "
		"SpeculationSafeValue Pseudos *****\n");
		// Step 3: Lower SpeculationSafeValue pseudo instructions.
		for (auto &MBB : MF)
		Modified \|= lowerSpeculationSafeValuePseudos(MBB);

return Modified;		return Modified;
}		}

/// \brief Returns an instance of the pseudo instruction expansion pass.		/// \brief Returns an instance of the pseudo instruction expansion pass.
FunctionPass *llvm::createAArch64SpeculationHardeningPass() {		FunctionPass *llvm::createAArch64SpeculationHardeningPass() {
return new AArch64SpeculationHardening();		return new AArch64SpeculationHardening();
}		}

test/CodeGen/AArch64/speculation-hardening-loads.ll

This file was added.

				; RUN: llc < %s -verify-machineinstrs -mtriple=aarch64-none-linux-gnu \| FileCheck %s --dump-input-on-failure

				define i128 @ldp_single_csdb(i128* %p) speculative_load_hardening {
				entry:
				%0 = load i128, i128* %p, align 16
				ret i128 %0
				; CHECK-LABEL: ldp_single_csdb
				; CHECK: ldp x8, x1, [x0]
				; CHECK-NEXT: cmp sp, #0
				; CHECK-NEXT: csetm x16, ne
				; CHECK-NEXT: and x8, x8, x16
				; CHECK-NEXT: and x1, x1, x16
				; CHECK-NEXT: csdb
				; CHECK-NEXT: mov x17, sp
				; CHECK-NEXT: and x17, x17, x16
				; CHECK-NEXT: mov x0, x8
				; CHECK-NEXT: mov sp, x17
				; CHECK-NEXT: ret
				}

				define double @ld_double(double* %p) speculative_load_hardening {
				entry:
				%0 = load double, double* %p, align 8
				ret double %0
				; Checking that the address laoded from is masked for a floating point load.
				; CHECK-LABEL: ld_double
				; CHECK: cmp sp, #0
				; CHECK-NEXT: csetm x16, ne
				; CHECK-NEXT: and x0, x0, x16
				; CHECK-NEXT: csdb
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: mov x17, sp
				; CHECK-NEXT: and x17, x17, x16
				; CHECK-NEXT: mov sp, x17
				; CHECK-NEXT: ret
				}

				define i32 @csdb_emitted_for_subreg_use(i64* %p, i32 %b) speculative_load_hardening {
				entry:
				%X = load i64, i64* %p, align 8
				%X_trunc = trunc i64 %X to i32
				%add = add i32 %b, %X_trunc
				%iszero = icmp eq i64 %X, 0
				%ret = select i1 %iszero, i32 %b, i32 %add
				ret i32 %ret
				; Checking that the address laoded from is masked for a floating point load.
				; CHECK-LABEL: csdb_emitted_for_subreg_use
				; CHECK: ldr x8, [x0]
				; CHECK-NEXT: cmp sp, #0
				; CHECK-NEXT: csetm x16, ne
				; CHECK-NEXT: and x8, x8, x16
				; csdb instruction must occur before the add instruction with w8 as operand.
				; CHECK-NEXT: csdb
				; CHECK-NEXT: mov x17, sp
				; CHECK-NEXT: add w9, w1, w8
				; CHECK-NEXT: cmp x8, #0
				; CHECK-NEXT: and x17, x17, x16
				; CHECK-NEXT: csel w0, w1, w9, eq
				; CHECK-NEXT: mov sp, x17
				; CHECK-NEXT: ret
				}

				define i64 @csdb_emitted_for_superreg_use(i32* %p, i64 %b) speculative_load_hardening {
				entry:
				%X = load i32, i32* %p, align 4
				%X_ext = zext i32 %X to i64
				%add = add i64 %b, %X_ext
				%iszero = icmp eq i32 %X, 0
				%ret = select i1 %iszero, i64 %b, i64 %add
				ret i64 %ret
				; Checking that the address laoded from is masked for a floating point load.
				; CHECK-LABEL: csdb_emitted_for_superreg_use
				; CHECK: ldr w8, [x0]
				; CHECK-NEXT: cmp sp, #0
				; CHECK-NEXT: csetm x16, ne
				; CHECK-NEXT: and w8, w8, w16
				; csdb instruction must occur before the add instruction with x8 as operand.
				; CHECK-NEXT: csdb
				; CHECK-NEXT: mov x17, sp
				; CHECK-NEXT: add x9, x1, x8
				; CHECK-NEXT: cmp w8, #0
				; CHECK-NEXT: and x17, x17, x16
				; CHECK-NEXT: csel x0, x1, x9, eq
				; CHECK-NEXT: mov sp, x17
				; CHECK-NEXT: ret
				}

				define i64 @no_masking_with_full_control_flow_barriers(i64 %a, i64 %b, i64* %p) speculative_load_hardening {
				; CHECK-LABEL: no_masking_with_full_control_flow_barriers
				; CHECK: dsb sy
				; CHECK: isb
				entry:
				%0 = tail call i64 asm "autia1716", "={x17},{x16},0"(i64 %b, i64 %a)
				%X = load i64, i64* %p, align 8
				%ret = add i64 %X, %0
				; CHECK-NOT: csdb
				; CHECK-NOT: and
				; CHECK: ret
				ret i64 %ret
				}