Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

[AArch64][PAC] Check authenticated LR value during tail call

Authored by atrosinenko on Jul 31 2023, 9:48 AM.



When performing a tail call, check the value of LR register after
authentication to prevent the callee from signing and spilling an
untrusted value. This commit implements a few variants of check,
more can be added later.

If it is safe to assume that executable pages are always readable,
LR can be checked just by dereferencing the LR value via LDR.

As an alternative, LR can be checked as follows:

  ; lowered AUT* instruction
  ; <some variant of check that LR contains a valid address>
  b.cond break_block
  ; lowered TCRETURN
  brk 0xc471

As the existing methods either break the compatibility with execute-only
memory mappings or can degrade the performance, they are disabled by
default and can be explicitly enabled with a command line option.

Individual subtargets can opt-in to use one of the available methods
by updating AArch64FrameLowering::getAuthenticatedLRCheckMethod().

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2023, 9:48 AM
atrosinenko published this revision for review.Jul 31 2023, 9:54 AM

This patch is inspired by the commit (and uses the same immediate operand for BRK instruction) but adds the checks during epilogue insertion instead of asm printing.

Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2023, 9:54 AM
kristof.beyls added inline comments.Aug 4 2023, 1:25 AM

My understanding is that using a load instruction to check the LR can only be done when code segments are not "execute-only".

The commit comment on suggests that at least LLD assumes that code generated for AArch64 targets is always compatible with execute-only segments. Therefore, I think that defaulting to checking-by-loading-lr is probably the wrong thing to do. It seems to me it would be better to default the other way and only allow checking-by-loading-lr when there is a guarantee that the target environment will not enforce "execute-only" code.


This is just nitpicking:

Would it be useful to have an isTailCallReturn function somewhere, and insert an


here? Given how a range of tail call pseudo opcodes have been added recently, it might be likely that a few more could get added in the future, in which case this switch statement needs to be adapted.
I'm just always very cautious when doing a switch on a set of specific opcodes as opcodes tend to evolve over time and such switch statements might be hard to maintain correctly. That's why I tend to prefer having an assert in the default statement that hopefully catches when that happens.

FWIW, already has a computation of an "IsTailCall" in this file (albeit that it doesn't consider TCRETURNriALL to be a tail call - is that an indication of an instance of the issue I described above?)).


I think this comment would be easier to read if instead of just saying "Turn it into:", it would also clearly indicate the intended effect. For example:

"To avoid generating a signing oracle, generate a code sequence that explicitly checks if the authentication passed or failed, as follows."

With just having written those extra few words now, I think that this extra checking may not be needed if the target core implements FEAT_FPAC (and maybe FEAT_FPACCOMBINE?). If so, maybe a FIXME here would be good to not generate the extra checking code sequences when FEAT_FPAC is implemented by the targeted core?


nitpick: I think I'd prefer [[AUTIASP]] rather than [[AUT]] as the macro name as that makes it clearer on first read exactly which authenticate operation is expected here. I do like the use of the macro though to hide away the difference between hint #29 and autiasp.


nitpick: similarly, I think I'd prefer [[XPACLRI]]

Address the review comments.

atrosinenko added a subscriber: olista01.EditedAug 4 2023, 8:42 AM

Updated the patch.


Updated. Though, I wonder if using the long snippet by default can introduce noticeable performance regression or not.

Meanwhile, is it permitted to remove a dead load in the machine pipeline (and should I mark it as volatile somehow).


Added an AArch64InstrInfo::isTailCallReturnInst(MachineInstr &) function and redirected there the existing code in getArgumentStackToRestore(). As far as I understand, the new TCRETURNriALL is only used by machine outliner and it seems to be used interchangeably with the existing TCRETURNdi instruction in

cc: @olista01



There exist separate review items D156784 and D156785 on taking FEAT_FPAC into account. I wonder if the particular CPU models should mention +fpac as supported or should it be only requested explicitly by the user - unlike many other subtarget features, FPAC doesn't make executable code explicitly fail on an unsupported CPU but silently makes it a bit less secure. So, the case "I have CPU X that implements all the instructions that are supported by Y (but not FEAT_FPAC)" may technically be unnoticed.





kristof.beyls added inline comments.Aug 8 2023, 6:09 AM

Yeah, I would expect the long snippet could result in a noticeable performance regression.
As discussed in last Monday's Pointer Authentication sync-up call, most current AArch64-based systems do not enforce execute-only. But that will probably change at some point in the future.
Maybe it would be best to make the code generation dependent on whether the target platform enforces execute-only? That information would probably need to be stored in some kind of TargetInformation class - I'm not sure if that information is currently stored anywhere.

@ab in that meeting also said that there was a 3rd option - checking the values of bits (I think, I don't fully remember) 55 and 56. Would that be a good option to implement/use by default?

It seems to me that at least in principle, later optimization passes are allowed to remove load instructions that they can prove have no side-effects. It may be prudent indeed to mark the load as volatile - or use any other mechanism to indicate that the load does have a side-effect and shouldn't be removed by later optimizations.


Good question!
I don't have a clear answer, I'm afraid.
I'd be interested to hear other people's opinions on this.

atrosinenko marked 2 inline comments as not done.Aug 10 2023, 7:12 AM
atrosinenko added inline comments.

Considering other possible options, IIUC something like this fragment is assumed. I glanced through Optimization Guides for a few CPU cores implementing FEAT_PAuth and it looks like XPAC* instructions are usually faster than other PAuth-related instructions (say, latency is 2, throughput is 1 compared to 5/1 for PAC* and AUT*) - this is somewhat expected as XPAC just clears a range of bits. On the other hand, EOR is still much more efficient.

Thus, it is probably worth implementing yet another option, but I doubt it should be the default because it relies on the particular TBI setting in effect while XPAC "just works" taking into account current settings, as far as I understand.


I think, in D156784 I could just add a TODO at now (to make hasFPAC available for usage) and FeatureFPAC may be added to relevant CPU cores by a later patch, if needed.

Rebased onto current main branch, will upload a few fixes shortly.


Reworked the patch

  • replaced "use a fast checker or not" boolean argument with multiple choices
  • created a separate source file and put checkAuthenticatedRegister() function there as it is likely to be later used by other parts of the codegen
  • exposed a tail-call-specific AArch64FrameLowering::checkAuthenticatedLR() similarly to signLR() and authenticateLR() so it can be used by outliner callbacks later
  • marked a dummy load instruction as volatile similar to Hexagon::PS_crash from HexagonInstrInfo.cpp
  • added the third implementation of authenticated address checker, similar to this code. Of course, more checkers can be implemented, but I think it would be better to leave this patch as a common implementation + a few "proofs of concept" and add more checkers via later patches, if needed.

No more changes remain planned on this patch, so I expect it to be ready for further review.


I reverted this option to using a fast checker by default. As far as I understand, there are no explicit function telling whether this particular subtarget expects execute-only-compatible code as it is expected to "just work" on AArch64. On the other hand, IIUC the support for execute-only mappings was recently dropped at least on Linux because of interference with Privilege Access Never feature.

At now, I just marked this option with a TODO because I expect DummyLoad to be much faster and even if someone want to use it in an execute-only environment, the issue is at least unlikely to be unnoticed.

kristof.beyls added inline comments.Aug 18 2023, 1:12 AM

I've done some further investigations and it seems that execute-only enforcement is most likely to get enabled in the not-too-distant future for some very popular AArch64 platforms. Therefore, having something that breaks the execute-only property enabled by default across all AArch64 platforms looks like a no-go. This would break most programs on those platforms.

It seems that this patch also changes code generation for pac-ret (i.e. where only return addresses are protected by pointer authentication)? There are a number of AArch64 platforms that already ship with pac-ret.
Enabling this for pac-ret moves the performance vs code-size vs security hardening trade-off for those platforms for their default code generation. Therefore, it seems to me that a lot of benchmarking would be needed to measure the code size and performance impact of this change before landing it for pac-ret.

With the above 2 observations in mind, I think that:

  • using the AuthCheckMethod::DummyLoad cannot be the default across all AArch64 platforms, as it will break most code on AArch64 platforms that plan to enable execute-only; and there is a strong indication that execute-only will be enforced on some of the most popular platforms.
  • Protecting against this authentication oracle for pac-ret code generation could only be done by doing substantial amounts of benchmarking to help make a decision on whether this is a worthwhile performance vs code-size vs security hardening trade-off.

I'd recommend to:
(a) not change code generation for pac-ret at all in this patch.
(b) change the default for AuthenticatedLRCheckMethod to something that does not break execute-only.

Ultimately, it seems to me that each platform will have to decide where its default should be in the performance vs code size vs hardening trade-off. With very few software platforms currently choosing to pay the cost for fine-grain control-flow integrity, it seems to me that a default of not hardening against this authentication oracle may be the least bad option available.

  • Updated the patch based on the changes from D159357
  • Made the check opt-in for subtargets, "none" by default, with command line option to override
  • Fixed use after free in createCheckMemOperand function (the same way it is done in Hexagon)
atrosinenko edited the summary of this revision. (Show Details)Sep 19 2023, 3:41 AM

Updated after the D159357 was changed.

Rebased and updated the patch a bit more:

  • in getOutliningCandidateInfo, adjusted SequenceSize variable before its first use
  • updated code comments
  • skipped checking LR if Scadow Call Stack is enabled: the LR value just before TCRETURN* instruction is anyway not the one produced by AUT* (and we cannot check right after AUT* because we cannot be sure which register is usable as a temporary)


Here is a summary of the contents of this patch:

  • implemented a standalone llvm::AArch64PAuth::checkAuthenticatedRegister utility function to emit one of a number of checks in case a pointer is AUT'ed and not immediately used for memory access
    • placed this function into a sub-namespace instead of making it a static class member, so I don't have to put otherwise irrelevant AArch64PointerAuth class definition to header file
    • note that the checks that are inserted by checkAuthenticatedRegister function are not specific to tail calls (but some of the checks may have restrictions - such as requiring AuthenticatedReg == LR because XPACLRI is encoded as HINT while generic XPAC* instructions require FEAT_PAUTH)
  • hooked it to AArch64PointerAuth class via checkAuthenticatedLR method dedicated to hardening tail calls
  • in machine outliner, update the costs computed by AArch64InstrInfo::getOutliningCandidateInfo method on a best-effort basis:
    • in MachineOutlinerTailCall outlining mode, we need to insert checks in each caller of OUTLINED_FUNCTION
    • in MachineOutlinerThunk mode, at most a single extra check is inserted in OUTLINED_FUNCTION itself
    • other modes do not introduce new tail calls, so let's just try to account for the checks that would possibly be inserted later into the original candidates by the AArch64PointerAuth pass (including in the two aboves modes)
  • factored out isTailCallReturnInst and needsShadowCallStackPrologueEpilogue utility functions

Thank you for the update!
I haven't looked at Arch64PointerAuth.cpp and the regression test in detail yet.
Most of the other parts of this patch looks fine to me, apart from the 2 nitpicks and the one probably bigger issue with getInstSizeInBytes() - see inline comments.


I had to spend quite a bit of time to understand the logic here.
I think, if possible, it would be easier to understand the logic if SequenceSize was updated much closer to where it is initially computed on line 8133.

Ideally, getInstSizeInBytes(MI) would return the correct number of bytes.

Actually, looking at the documentation of getInstSizeInBytes(MI), at, it states that getInstSizeInBytes is guaranteed to return the maximum number of bytes.
So on a tail call return that might produce lots of bytes, it should return the maximum number.

If it does not produce the maximum number, I seem to remember that that can lead to hard-to-debug compilation failures where the code size estimation for computing whether a constant island is within range could go wrong. (I might be misremembering, it could be something different than a constant island going out of range).

So, I think that getInstSizeInBytes(MI) needs to be adapted so it returns the correct maximum size in bytes for a tail call. When that is implemented, the change on this line in this patch may no longer be needed?

Assuming I remember all of the above correctly, it may be good to add a regression test that verifies that getInstSizeInBytes is calculated correctly for large tail calls. is a patch that fixed a similar issue on another instruction. The test case that was added there could serve as inspiration.


nit pick - feel free to disagree. IIUC, all methods are pre-v8.3 compatible. Is it then worthwhile to call pre-v8.3 compatibility out in the help text? I think I'd remove the "(pre-v8.3 compatible)" part.



Updated the patch, thank you.


On one hand, updating getInstSizeInBytes(MI) should fix not only the computation of SequenceSize but other possible callers of getInstSizeInBytes as well. On the other hand, inserting checker code before TCRETURN* instructions seems like just yet another code transformation (and TCRETURN* instructions are actually four bytes long after AArch64PointerAuth pass processes the function).

Moreover, in getInstSizeInBytes(MI), there is handling for several special cases and a comment stating

// FIXME: We currently only handle pseudoinstructions that don't get expanded
//        before the assembly printer.

Thus, I would rather keep updating SequenceSize ad-hoc, but move the update right after NumBytesToCheckLRInTCEpilogue is computed.


There will probably be non v8.2-compatible checkers in the future - for example, if we want to check an *arbitrary* register using XPAC. Though, I agree that it is better to drop the "(pre-v8.3 compatible)" part here: it would be better to add more comprehensible comment like "(requires FEAT_PAUTH)" to those checkers, I think.



Addressed the comments so far.

kristof.beyls added inline comments.Oct 4 2023, 2:22 AM

Thanks for pointing to that FIXME - indeed, it seems like as long as the expansion happens early enough, there should be no issue.


I think this class could use a comment.
Would something like the following be correct?

// AddressCheckPseudoSourceValue represents the memory access inserted by the
// the AuthCheckMethod::DummyLoad method to trigger a run-time error if the
// authentication of a pointer failed.

Seeing this is a static variable made me think: would it be possible in a single run of any program using the LLVM libraries (such as clang, flang, rust, opt, ...) for MF.getTarget() to potentially return a different result sometimes within the same execution of the that program.
If that could occur, then some from some executions of this function, the CheckPSV variable could be initialized with the wrong TargetMachine reference....

I cannot immediately think of an example, but I am also not sure.
I think it would be more prudent to not have this as a static variable.


I'm wondering why any/all of the machineinsts created in this function need to have the FrameDestroy flag set? Do you know?


The convention in the LLVM code base is to use "FIXME" rather than "TODO".

Address the review comments.

atrosinenko added inline comments.Oct 6 2023, 7:11 AM

Added your comment, thank you.


I used the same approach as it is implemented in Hexagon backend: HexagonInstrInfo.cpp. For now TargetMachine is only used by PseudoSourceValue constructor to call TM.getAddressSpaceForPseudoSourceKind(Kind). I agree that using static instance looks suspicious, but I guess allocating new instance from heap on each check is quite wasteful.



atrosinenko added inline comments.Oct 6 2023, 7:23 AM

Setting MachineInstr::FrameDestroy unconditionally in checkAuthenticatedRegister is definitely a mistake, thank you. Maybe I have to pass MI flags as an argument, but it seems that it works as-is, at least for generating DWARF debug info (see Epilogue Begin marker).

Added an explicit assertion that WinCFI is not requested as I don't yet emit any SEH opcodes.

$ cat /tmp/tail-call.c 
int caller_indirect(int *n, int (fptr)(int*)) {
  asm volatile ("" ::: "lr");
  *n = 42;
  return fptr(n);
$ ./bin/clang -O1 -target aarch64-linux-gnu /tmp/tail-call.c -c -o /tmp/tail-call.o -mbranch-protection=pac-ret -mllvm -aarch64-authenticated-lr-check-method=xpac-hint -g
$ dwarfdump -l /tmp/tail-call.o && llvm-objdump -d /tmp/tail-call.o 

.debug_line: line number info for a single cu
Source lines (from CU-DIE at .debug_info offset 0x0000000c):

            NS new statement, BB new basic block, ET end of text sequence
            PE prologue end, EB epilogue begin
            IS=val ISA number, DI=val discriminator value
<pc>        [lno,col] NS BB ET PE EB IS= DI= uri: "filepath"
0x00000000  [   1, 0] NS uri: "/tmp/tail-call.c"
0x0000000c  [   2, 3] NS PE
0x0000000c  [   3, 6] NS
0x00000010  [   4,10] NS ET EB
0x00000030  [   4,10] NS ET

/tmp/tail-call.o:       file format elf64-littleaarch64

Disassembly of section .text:

0000000000000000 <caller_indirect>:
       0: d503233f      paciasp
       4: f81f0ffe      str     x30, [sp, #-0x10]!
       8: 52800548      mov     w8, #0x2a
       c: b9000008      str     w8, [x0]
      10: f84107fe      ldr     x30, [sp], #0x10
      14: d50323bf      autiasp
      18: aa1e03f0      mov     x16, x30
      1c: d50320ff      xpaclri
      20: eb1e021f      cmp     x16, x30
      24: 54000041    0x2c <caller_indirect+0x2c>
      28: d61f0020      br      x1
      2c: d4388e20      brk     #0xc471
kristof.beyls added inline comments.Oct 9 2023, 1:58 AM

I guess this would be allocating new instances on the stack rather than the heap?
Anyway, I think it would be better to not rely on a static variable where we know this is likely to trigger a hard-to-track-down bug at some point in the future, for some use cases.

If the cost would be too high, I guess it might be possible to "cache" CheckPSV once per MachineFunction being processed? That may require finding an object that is constructed once per MachineFunction and put that cached CheckPSV in there. Not sure if there is such an obvious object at the moment and if so, how much refactoring would be needed to make this possible.

Anyway, I guess this doesn't get called that often (only once per tail call) so presumably it isn't worthwhile to make the code a lot more complicated for the small (would it even be measurable?) compile time gain.

Moved AddressCheckPseudoSourceValue to AArch64SubtargetInfo.

atrosinenko added inline comments.Oct 9 2023, 9:36 AM

Unfortunately, MachinePointerInfo stores a pointer to CheckPSV, so allocating it on stack is not possible.

Turned out, each MachineFunction has PseudoSourceValueManager, but it is not subclassed per-target, so caching target-specific PSV would require some refactoring. On the other hand, AddressCheckPseudoSourceValue has quite generic semantics, so moved it to AArch64SubtargetInfo.

kristof.beyls accepted this revision.Oct 10 2023, 5:13 AM

I think that with the last change, all my comments have been addressed.
Please do react if I accidentally missed a comment that needs further work.
Assuming indeed all comments have been addressed: LGTM!


Thank you, that looks like a neat solution.

This revision is now accepted and ready to land.Oct 10 2023, 5:13 AM

Added FIXME mentioning FEAT_FPAC to AArch64PointerAuth::checkAuthenticatedLR.

Thank you for the review. I read through the comments - they are already addressed except maybe for a few FPAC-related notes, so I added a FIXME to checkAuthenticatedLR function and left a note in the discussion of D156784.

I will rebase and retest this patch and land it shortly if everything works.

chill added a subscriber: chill.Oct 14 2023, 5:01 AM
chill added inline comments.

nit: The check for FEAT_FPAC perhaps can be done in getAuthenticatedLRCheckMethod (and possibly return None), so the logic of deciding whether to emit code or not is kept in one place.