This is an archive of the discontinued LLVM Phabricator instance.

[LLD][ELF][AArch64] Complete implementation of -fix-cortex-a53-843419
ClosedPublic

Authored by peter.smith on Aug 15 2017, 8:54 AM.

Details

Summary

This patch provides the mechanism to fix instances of the instruction sequence that may trigger the cortex-a53 843419 erratum. The fix is provided by replacing one of the instructions in the erratum sequence with a branch to a replacement patch sequence. The patch sequence is just the same instruction, followed by a branch back to the instruction following the one we replaced. The act of adding the branch is sufficient to prevent the erratum.

The implementation of the patching is similar to Thunks. Once we have discovered an instance of the erratum sequence we create a patch that we will insert into a PatchSection in the same way that we insert a Thunk into a ThunkSection. Once we have a location and a symbol defined for the patch we can modify or add a relocation to the InputSection we are patching (dependent on D36745 to overwrite all bits of an instruction when doing relocation) to point to the patch.

This is patch 3 of 3 to fix pr33463 https://bugs.llvm.org/show_bug.cgi?id=33463 . It depends on D36745, D36742 and D36739

There are a number of design decisions that I've taken with the fix:

  • I've chosen to follow the branch to patch fix in the same way as gold and ld.bfd. I've left a comment in pr33463 https://bugs.llvm.org/show_bug.cgi?id=33463#c15 with the reasons why.
  • In theory Thunks and Patches can be merged or at least more tightly integrated. I've chosen not to do this as AArch64 does not currently support Thunks and no other Target needs patching, and merging the implementations would complicate both. If AArch64 needs to support range-thunks in the future some tighter integration work may be required.
  • I've chosen to follow ld.bfd and round the size of the PatchSection up to the next 0x1000 boundary, this means that if there are InputSectionDescriptions or OutputSections following an inserted PatchSection the addresses of the instructions modulo 0x1000 won't change so that we can do all the scanning and patching in one pass.

Full details of the erratum sequence can be found in http://infocenter.arm.com/help/topic/com.arm.doc.epm048406/Cortex_A53_MPCore_Software_Developers_Errata_Notice.pdf

Diff Detail

Event Timeline

peter.smith created this revision.Aug 15 2017, 8:54 AM

Rebased tests to account for removal of -print-fixes in favour of just using -verbose. No other changes.

ruiu added inline comments.Sep 5 2017, 5:09 PM
ELF/SectionPatcher.cpp
72

nit: add a blank line before comment

96

I'd move this to the .h file.

506

Flip the condition and return early.

595

I don't fully understand the new code, but it looks like it nests too deeply. Can you split the function?

ELF/SectionPatcher.h
24

Can you mention that this is ARM only?

The comment is somewhat abstract, but as we support only one feature (-fix-cortex-a53-843419) at the moment, I'd describe it directly. Say, an early version of of the processor has a bug that executes some instructions wrongly if they are near page boundaries. As a workaround, the linker replaces such instructions with branch instructions that jump to linker-generated thunks. Thunks are not at page boundaries. They executes the original instructions and then jump back to the original locations.

Thanks very much for the comments. I've made the small changes and have attempted to split up the nested function. I've removed the check for a relocation to an existing patch as this isn't necessary in a one pass implementation and it lets me simplify the code a bit.

ruiu edited edge metadata.Sep 7 2017, 6:06 PM

I have an impression that this patch can be simplified.

Does it make sense to separate Patch and PatchSection? Naively, I was thinking that patching can be done this way.

  1. Fix the output layout
  2. Scan all input sections in address-increasing order while creating and adding patch sections

In step 2, we create a synthetic patch section for each problematic instruction and add it after the current input section. When you add a synthetic patch section, you already know the distance from the problematic instruction and the synthetic patch section, so you don't need to create Relocation but can just fix instruction operands.

ELF/SectionPatcher.cpp
65

What is "Instruction 4"?

ELF/SyntheticSections.h
760

By convention, AlignToSize -> Alignment, if it means an (required) alignment.

I've updated the diff for the inline comments. With respect to the overall approach:

I have an impression that this patch can be simplified.

Does it make sense to separate Patch and PatchSection? Naively, I was thinking that patching can be done this way.

  1. Fix the output layout
  2. Scan all input sections in address-increasing order while creating and adding patch sections

In step 2, we create a synthetic patch section for each problematic instruction and add it after the current input section. When you add a synthetic patch section, you already
know the distance from the problematic instruction and the synthetic patch section, so you don't need to create Relocation but can just fix instruction operands.

The main reason to use the PatchSection is that it can be rounded up to a multiple of the page-size so that adding it doesn't affect the address calculations modulo page size of other InputSections. This allows us to get away with a single pass, and to only patch instructions that we need to patch.

With the approach outlined above the patch(es) that are added after each InputSection will affect the address calculations of all subsequent sections. To account for this we have to do one of the following:

  • Make the patch(es) that we add after an InputSection have a size that is 0 modulo page-size, this is potentially wasteful as we make each patch a minimum of 4k
  • Recalculate addresses after patching and like range-thunks, iterate until no more patches added. This may result in over-patching of InputSections which is not desirable but ok for correctness. We also have to handle the special case of skipping over a sequence we've already patch.
  • Attempt to keep a running total of the size of patches we've added in between sections. I think that this can be done for all the InputSections within an InputSectionDescription, but outside of that a linker script can make this very complicated to get right ( we can't know for sure whether the address of the next InputSectionDescription is relative to the previous one or absolute without reading the linker script).

One thing I don't quite understand is how we would directly modify the opcodes in the InputSection. My naive understanding is that at the point we do patching the contents of the InputSection are read-only, it is only when the contents has been copied to the output memory buffer that we can modify them. At this point describing the modifications in terms of relocations seemed the simplest way of modifying the instructions. Have I got this wrong?

I'm happy to try an alternative approach, although my current thinking right now would be that it wouldn't significantly simplify the solution, just a different set of trade-offs. If you have a preference for any of the alternatives let me know and I'll send out an alternative review to see if it is any better than this approach?

Rebased to account for recent changes in SyntheticSections.h, no other changes.

I've refactored the patch in an attempt to simplify it. The involves a change of strategy to make each patch its own SyntheticSection, this has the following implications:

  • We can get rid of the distinction between Patches and PatchSections, there is one instance of a Patch843419Section per erratum instance.
  • We can't round up the size of an individual Patch843419Section to avoid disturbing the page-offset of subsequent sections as this would be too wasteful in size.
  • As adding one or more Patch843419Sections may disturb the page offset of following sections, we have to do multiple passes until no more Patches are added.
  • I've converted the function into a class so we only need to create the SectionMap once.
  • Updated the tests to account for the smaller patches.

It is debatable whether this is significantly simpler than before, however it should integrate better with RangeThunks should they be implemented for AArch64.

ruiu added a comment.Sep 19 2017, 5:23 PM

I'm really sorry for slow code review. This is probably okay, but I already don't understand some lld's relocation handler code, and since these series of patches inevitably increase complexity of it, I'm worried that it could be the final shot to make it unable to understand. Relocations.cpp is in particular hard to read and contains a lot of black magics, and it is hard to modify it without breaking subtle ELF features that are implemented in an obscure way. It used to be much easier to read. We should've refactored it earlier. I believe we should rewrite it (or refactor it from the ground up) sooner or later. The problem is I was thinking is, whether this patch should be submitted over the code of that quality or not. I'm sorry that I don't have an answer to that question, but I'll take a look at them and the lld's code itself and think again.

Thank you very much for the comment, it is helpful to know what your constraints are. AArch64 and especially Arm are somewhat awkward targets as the ABI has been designed to move some of the toolchain complexity into the linker in order to simplify the compiler. Apologies in advance for the long response which I hope makes sense and is of some use. While I obviously would prefer to get these in earlier I will be at the US developers meeting next Month and will be happy to discuss.

I've written down what I think the requirements Thunks and Errata patching have with respect to relocations, I think that the majority of the code wouldn't need changing if relocations were heavily refactored. I do understand that they have their own complexity as well, a lot of it down to needing to run very late in the link due to needing address assignment.

An abstract thunk implementation:

  • It must be possible to iterate over all the relocations in the program and identify the subset of branch relocations that need a Thunk.
  • It must be possible to create a linker generated code-sequence (Thunk) and place it within range of the branch instruction.
  • It must be possible to resolve the branch to the Thunk instead of the original Target (redirect the relocation).
  • It must be possible for the code-sequence generated in the Thunk to transfer control to the original target of the relocation.
  • It must be possible to generate a Thunk to a PLT entry.
  • It is desirable to be able to reuse as many Thunks as possible.
  • It is desirable guarantee to guarantee that all branches can reach their target (alternative is relocation out of range error message)

The existing Thunk implementation (and range thunks proposal)

  • We iterate over all the thunks in the program and can read their Type to know that we may need to create a Thunk.
  • We create a SyntheticSection for the Thunks, which allows us to create a Symbol that can be used as a target for relocations.
  • We change the target of the relocation to the Symbol in the Thunk.
  • We use the relocation code in the SyntheticSection to write the transfer of control to the original Target.

The alternative to changing the target of the Thunk is to modify RelExpr, such as what is done for PLT generation (when the linker sees R_PLT_PC, it knows that it needs to resolve the relocation not to the target symbol, but to the PLT entry for that symbol). Altering RelExpr does not require a new Symbol to be created as the relocation Target isn't changed. However it does mean that the relocation code and other areas need to distinguish between R_PC and R_PLT_PC. There are trade-offs for each approach, for example if Thunks were to change the RelExpr and indirect via some Thunk target then we would have to handle R_THUNK_PC and R_THUNK_PLT_PC. Redirecting the relocation Targets means that calls to Thunks can be resolved like any other branch relocation.

The existing implementation does not really need to be in Relocations.cpp, it could be moved out to Thunks.cpp if the one function that it needs fromPlt() is made global. In summary I think that as long as we can generate Symbols in SyntheticSections, change the target symbol of relocations, then pretty much any redesign of relocations will be fine. It is helpful to reuse the existing relocation code to write the final transfer of control instructions, but there isn't anything stopping that code from being duplicated.

Abstract errata patching requirements (for the cortex-a53-843419 erratum):

  • It must be possible to disassemble the code to look for the erratum sequence.
  • It must be possible to generate a patch sequence that can be placed within branch range of the erratum sequence.
  • It must be possible to replace an instruction in the erratum sequence with a branch to a patch that contains the instruction.
  • It must be possible to transfer any relocation on the instruction we replace with a branch to the patch.
  • It must be possible to return from the patch to the next instruction following the patch.
  • It is desirable for both efficiency of the linker, and to minimize the number of patches, to have precise address information available so that only erratum sequences with page-offset of the ADRP instruction of 0xff8 or 0xffc are considered.

The current implementation uses relocations:

  • We (ab)use the regular encoding of the AArch64 branch instruction to use a R_AARCH_JUMP26 relocation to replace an instruction with a branch. We may need to overwrite an existing relocation at the location or create a new one.
  • We account for an existing relocation at the location we are patching by copying it to the patch.
  • We reuse the existing relocation code to write the return address and resolve any relocation we have copied to the patch.

I've used relocations to do the dirty work of modifying the instructions mostly for convenience. I wanted to use the existing ELF building blocks that a compiler might use to construct the patches. It could be possible to add a patch list to InputSection and hard-code the changes without relocation but I think that this would just be duplicating functionality.

Both of the implementations try to follow the principle of adding SyntheticSections with symbols and retargeting relocations at those Symbols to keep them as an isolated pass. I think that they probably wouldn't get in the way of a major refactor.

Rebased to account for recent refactorings. No other changes.

Rebase after changes made in D36742.

Updated to account for rebasing of D36742

Rebased on top of changes to D36742, no other changes.

Rebased after changes to D36749.

  • We no longer initialize the mapping symbols in the constructor as we may be constructed but not used.
  • If a patch is inserted we must update addresses again for thunks.

Still todo:

  • The implementation assumes that no OutputSection is > 128 Mb. I've encountered at least one real world program (gitit) that has a 180 Mb .text section, so I'll need to fix that particular fixme.

Updated patch to remove the assumption that a single .text section is less than 128Mb, and added test case to cover. The change just inserts patches at roughly branch range intervals rather than always putting them at the end.

Comments from Rafael

These comment updates are fine. Please commit them first.

Ok, committed in r320372

+class SectionPatcher {

Maybe call it AArch843419Patcher?

I've gone for AArch64ErrPatcher.

+ Apply any relocation transferred from the original PatcheeSection.
+
For a SyntheticSection Buf already has OutSecOff added, but relocateAlloc
+ // also adds OutSecOff so we need to subtract to avoid double counting.
+ this->relocateAlloc(Buf - OutSecOff, Buf - OutSecOff + getSize());

I wonder if we could read the already patched output buffer and avoid
this? Can we guarantee that the patch is always written after the patchee?

The patch will always be later in the list of InputSections than the patchee, but if the writing of InputSections is multithreaded then I don't think we could. In any case with the current implementation we use a branch relocation on the location of the original patchee instruction to mutate it into a branch. I think that if we were to go down that route we'd need to do something like Gold, it does 2 passes over the relocations, once for non-patch sections and once for patch sections, with the patch section pass responsible for mutating the patchee instruction into a branch after copying it into the patch.

+ std::merge(ISD.Sections.begin(), ISD.Sections.end(), Patches.begin(),
+ Patches.end(), std::back_inserter(Tmp), MergeCmp);

After this the patch sections still have a OutSecOff of the limit of
where they can be placed, no? Is that OK?

Yes as we only merge into the InputSectionDescription->Sections once using OutSecOff in the comparison function for the merge. At the end of the pass assignAddresses() will give each patch the correct OutSecOff.

+void SectionPatcher::patchInputSectionDescription(
+ InputSectionDescription &ISD, std::vector<Patch843419Section *> &Patches) {

return the std::vector.

Ok, done.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 15 2017, 2:33 AM
This revision was automatically updated to reflect the committed changes.