Page MenuHomePhabricator

Align branches within 32-Byte boundary
Needs ReviewPublic

Authored by skan on Nov 12 2019, 6:50 PM.

Details

Summary

Microcode update for Jump Conditional Code Erratum may cause performance
loss for some workloads:

https://www.intel.com/content/www/us/en/support/articles/000055650.html

Here is the patch to mitigate performance impact by aligning branches
within 32-byte boundary. The impacted instructions are:

a. Conditional jump.
b. Fused conditional jump.
c. Unconditional jump.
d. Indirect jump.
e. Ret.
f. Call.

Add an option -mbranches-within-32B-boundaries to align branches within a
32-Byte boundary to reduce the potential performance loss of the microcode
update. The option is equivalent to the combination of three options:

-malign-branch-boundary=32
-malign-branch=fused+jcc+jmp
-malign-branch-prefix-size=5

and add -x86-branches-within-32B-boundaries for llvm-mc to enable
-x86-align-branch-boundary=32
-x86-align-branch=fused+jcc+jmp
-x86-align-branch-prefix-size=5

More fine options added for clang:

  1. -malign-branch-boundary=NUM aligns branches within NUM byte boundary.
  2. -malign-branch=TYPE[+TYPE...] specifies types of branches to align.
  3. -malign-branch-prefix-size=NUM limits the prefix size by NUM

per instruction.

The correponding options for llvm-mc are -x86-align-branch-boundary=NUM,
-x86-align-branch=TYPE[+TYPE...], -x86-align-branch-prefix-size=NUM.

A new MCFragment type, MCMachineDependentFragment, is added, which has
6 subtypes:

  1. BranchPadding: The variable size frag to insert NOP before branch.
  2. BranchPrefix: The variable size frag to insert segment prefixes to an instruction. The choice of prefixes are: a. Use the existing segment prefix if there is one. b. Use CS segment prefix in 64-bit mode. c. In 32-bit mode, use SS segment prefix with ESP/EBP base register and use DS segment prefix without ESP/EBP base register.
  3. FusedJccPadding: The variable size frag to insert NOP before fused conditional jump.
  4. BranchSplit: The 0 size frag to separate the instruction which is fused with the following conditional jump from fused jcc.
  5. HardCodeBegin: The zero size frag to mark the begin of the sequence of hard code.
  6. HardCodeEnd: The zero size fragment to mark the end of the sequence of hard code.

alignBranchesBegin and alignBranchesEnd are used to
insert MCMachineDependentFragment before instructions, relaxMachineDependent
grows or shrinks sizes of prefix and NOP to align the next branch frag:

  1. First we try to add segment prefixes to instructions before a branch.
  2. If there is no sufficient room to add segment prefixes, NOP will be

inserted before a branch.

The prefix or nop padding is disabled in two cases:

  1. If the previous item is hard code, which may be used to hardcode an

instruction, since there is no clear instruction boundary.

  1. If instruction may be rewritten by the linker, such as TLS Call.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
skan added a comment.Wed, Nov 20, 11:35 PM

Not insert nop or prefix if the previous item is hard code, which may be
used to hardcode an instruction, since there is no clear instruction boundary.

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

  • Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:
    • For most instructions, that'll be BranchPrefix.
    • For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit
    • For fused conditionals, it'll be FusedJccPadding.
  • After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.
  • Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

llvm/lib/MC/MCAssembler.cpp
1030

I don't think this is necessary. AFAICT, the symbols should already be in the right place -- pointing to the relax fragment, not the instruction itself, without this. And removing all this moveSymbol/updateSymbolMap code doesn't make any tests fail.

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

  • Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:
    • For most instructions, that'll be BranchPrefix.
    • For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit
    • For fused conditionals, it'll be FusedJccPadding.
  • After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.
  • Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

    All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

    Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

    An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I don't have any numbers myself. I was only involved in some of the code review internally. My understanding is that NOP instructions would place extra nop uops into the DSB(the decoded uop buffer) and that limits the performance that can be recovered. By using redundant prefixes no extra uops are generated and more performance is recovered.

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

efriedma added inline comments.
llvm/include/llvm/MC/MCFragment.h
591

Global variables are forbidden in LLVM libraries; there could be multiple LLVMContexts in the same process.

skan marked an inline comment as done.Thu, Nov 21, 8:26 PM

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

  • Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:
    • For most instructions, that'll be BranchPrefix.
    • For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit
    • For fused conditionals, it'll be FusedJccPadding.
  • After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.
  • Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

    All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

    Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

    An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

    I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

Thanks for your detailed and accurate analysis for my code! I am sorry that this should have been done by me.

llvm/lib/MC/MCAssembler.cpp
1030

Yes, I check it and you are right. I will removing all this moveSymbol/updateSymbolMap code.

skan updated this revision to Diff 230605.Fri, Nov 22, 1:12 AM
skan edited the summary of this revision. (Show Details)

Three changes are made:

  1. Remove moveSymbol/updateSymbolMap code since it is not necessary
  2. Make variable AlignBoundarySize and variable AlignMaxPrefixSize not global
  3. Disable nop padding before instruction with variant symbol operand since it may be rewritten by linker.
skan updated this revision to Diff 230617.EditedFri, Nov 22, 2:25 AM

Speed up if we only want to simply emit NOPs before branches as needed.

skan added a comment.Fri, Nov 22, 3:01 AM

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

Thanks for your remind! Now, if -malign-branch-prefix-size=0 is used, the method will only insert BranchPadding or FusedJccPadding before branches as needed, and insert BranchPrefix before the instruction which is macro fusible but not macro fused. In this condition, the operation is as simple as inserting nop only. Since more performance can be recovered by inserting prefixes than inserting nops, I believe I should support prefix padding.

skan edited the summary of this revision. (Show Details)Fri, Nov 22, 3:48 AM

(Just a reminder that we need to have both performance and code size numbers for this patch. And given that there are a few options, may need a few examples.)

I have some other concerns about the code itself, but after pondering this a little bit, I'd like to instead bring up some rather more general concerns about the overall approach used here -- with some suggestions. (As these comments are not really comments on the _code_, but on the overall strategy, they also apply to the gnu binutils patch for this same feature.)

Of course, echoing chandler, it would be nice to see some performance numbers, otherwise it's not clear how useful any of this is.

Segment-prefix padding

The ability to pad instructions instead of adding a multibyte-NOP in order to create alignment seems like a generally-useful feature, which should be usable in other situations where we align within a function -- assuming that there is indeed a measurable performance benefit vs NOP instructions. E.g. we should do this for basic-block alignment, as well! As such, it feels like this feature ought to be split out, and implemented separately from the new branch-alignment functionality -- in a way which is usable for any kind of alignment request.

The way I'd imagine it working is to introduce a new pair of asm directives, to enable and disable segment-prefix padding in a certain range of instructions (let's say ".enable_instr_prefix_pad", ".disable_instr_prefix_pad". Within such an enabled range, the assembler would prefix instructions as required in order to handle later nominally-nop-emitting code alignment directives (such as the usual '.p2align 4, 0x90') .

Branch alignment

The primary goal of this patch, restricting the placement of branch instructions, is a performance optimization. Similar to loop alignment, the desire is to increase speed, at the cost of code-size. However, the way this feature has been designed is a global assembler flag. I find that not ideal, because it cannot take into account hotness of a block/function, as for example loop alignment code does. Basic-block alignment of loops is explicitly opt-in on an block-by-block basis -- the compiler simply emits a p2align directive where it needs, and the assembler honors that. And so, MachineBlockPlacement::alignBlocks has a bunch of conditions under which it will avoid emitting a p2align. This seems like a good model -- the assembler does what it's told by the compiler (or assembly-writer). Making the branch-instruction-alignment work similarly seems like it would be good.

IMO it would be nicest if there could be a directive that requests to specially-align the next instruction. However, the fused-jcc case makes that quite tricky, so perhaps this ought to also be a mode which can be enabled/disabled on a region as well.

Enabling by default

Previously, I'd mentioned that it seemed likely we couldn't actually enable branch-alignment by default because it'll probably break people's inline-asm and standalone asm files. That would be solved by making everything controllable within the asm file. The compiler could then insert the directives for its own code, and disable it around inline assembly. And standalone asm files could remain unaffected, unless they opt in. With that, we could actually enable the alignment by default, for compiled output in certain cpu-tuning modes, if it's warranted.

reames added a subscriber: reames.Mon, Dec 2, 9:02 AM
reames added a comment.Mon, Dec 2, 9:34 AM

I want to chime in support of jyknight's meta comments - particularly the one about the need to balance execution speed vs code size differently in hot vs cold code. For our use case, we have a very large amount of branch dense known cold paths, and being able to only align fast path branches would be a substantial space savings.

I also see value in having the prefix padding feature factored out generically. If that mechanism is truly measurably faster than multi-byte nops - which if I reading comments correctly, has been claimed but not documented or measured? - using it generically for other alignment purposes would likely be worthwhile.

I'd also like to see - probably in a separate patch - support for auto-detecting whether the host CPU needs this mitigation. Both -mcpu=native and various JITs will end up needing this, having the code centralized in one place would be good.

Just FYI for now as I'm trying to dig futher...
I have been trying this fix in our downstream environment and managed to get a hang with this backtrace:

#0 needPadding (BoundarySize=<optimized out>, Size=<optimized out>, StartAddr=<optimized out>) at ./llvm/lib/MC/MCAssembler.cpp:1028
#1 llvm::MCAssembler::relaxMachineDependent (this=<optimized out>, Layout=..., MF=...) at ./llvm/lib/MC/MCAssembler.cpp:1077
#2 0x00007ffff1adc8b6 in llvm::MCAssembler::layoutSectionOnce (this=this@entry=0x7fff90580580, Layout=..., Sec=...) at ./orca/llvm/lib/MC/MCAssembler.cpp:1213
...

Working on getting upstream reproducer....

fedor.sergeev requested changes to this revision.EditedTue, Dec 3, 2:42 PM

Working on getting upstream reproducer....

Hangs on a moderately small piece (~150 lines) of x86 assembly, filed a bug on it here:

https://bugs.llvm.org/show_bug.cgi?id=44215

This revision now requires changes to proceed.Tue, Dec 3, 2:42 PM

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

craig.topper added inline comments.Tue, Dec 3, 10:23 PM
llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
158 ↗(On Diff #232032)

None of the AND/ADD/SUB instructions ending in mr are eligible for macrofusion as far as I know. Those all involve a load and a store which is not supported by macrofusion.

We also lost all the ADD*_DB instructions from the macrofusion list. I believe they are in the existing list incorrectly. So removing them is correct, but as far as I can see that change was not mentioned in the description of this patch.

Can we split the macrofusion refactoring out of this patch so we can review it separately and hopefully get it committed sooner than the other review feedback.

annita.zhang added a comment.EditedTue, Dec 3, 11:29 PM

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

Sorry for belated response. We're working hard to go through some paper work to get the performance data ready. I think maybe it's better to open a mailing thread in llvm-dev to post those performance data and discuss those suggestions.

The first data was posted in http://lists.llvm.org/pipermail/llvm-dev/2019-December/137413.html.

Thanks,
Annita

skan updated this revision to Diff 232045.Wed, Dec 4, 12:17 AM

Fix the macro fusion table. If the first source/destination operand of an instruction is memory, it is unfusible.

skan marked an inline comment as done.Wed, Dec 4, 12:20 AM
skan added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
158 ↗(On Diff #232032)

Okay, I will upload another patch to correct the macrofusion table as soon as possible.

Can you please put the macro fusion changes in a separate phabricator review. I’ll review it in the morning US time and if it all looks good we can get that part committed while the other comments are being addressed.

skan added a comment.Wed, Dec 4, 12:47 AM

Can you please put the macro fusion changes in a separate phabricator review. I’ll review it in the morning US time and if it all looks good we can get that part committed while the other comments are being addressed.

Sure.

skan updated this revision to Diff 232068.Wed, Dec 4, 3:08 AM

We shouldn't try to insert prefix/nop t in a virtual section.

skan updated this revision to Diff 232116.Wed, Dec 4, 6:33 AM

Since hardcode only exists in text section, we should not insert HardCodeBegin/HardCodeEnd in other section, such as a virtual section.

spatel added a subscriber: spatel.Wed, Dec 4, 6:57 AM

I am still trying to understand the patch. Just made some comments about the tests.

llvm/include/llvm/MC/MCFragment.h
570

Don’t duplicate function or class name at the beginning of the comment (BranchPadding - ). (ref: https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments)

581

Full stop.

616

Move llvm_unreachable below the switch, otherwise clang will give a warning:

warning: default label in switch which covers all enumera

tion values [-Wcovered-switch-default]

Unfortunately all GCC (even 9) -Wall will warn warning: control reaches end of non-void function [-Wreturn-type] unless you place an unreachable statement.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
546

Space after if

llvm/test/MC/X86/x86-64-align-branch-1a.s
1 ↗(On Diff #232116)

1a~1g use the same source file. Move the source to Inputs/align-branch-1-64.s

According to the local naming convention, this test should probably be renamed to align-branch-1-64.s

7 ↗(On Diff #232116)

Delete

11 ↗(On Diff #232116)

Delete Disassembly of section .text:. Ditto below.

llvm/test/MC/X86/x86-64-align-branch-1b.s
4 ↗(On Diff #232116)

Delete

10 ↗(On Diff #232116)

I think 1a.s and 1b.s should be merged. FileCheck supports

--check-prefixes=CHECK,PREFIX5
--check-prefixes=CHECK,PREFIX1

CHECK: common part
CHECK-NEXT: common part
PREFIX5:
PREFIX5-NEXT:
PREFIX1:
PREFIX1-NEXT:
CHECK:
CHECK-NEXT:
% diff -U1 x86-64-align-branch-1[ab].s
 # CHECK: 0000000000000000 foo:
-# CHECK-NEXT:        0: 64 64 64 64 89 04 25 01 00 00 00 movl    %eax, %fs:1
-# CHECK-NEXT:        b: 55                               pushq   %rbp
-# CHECK-NEXT:        c: 55                               pushq   %rbp
-# CHECK-NEXT:        d: 55                               pushq   %rbp
+# CHECK-NEXT:        0: 64 89 04 25 01 00 00 00          movl    %eax, %fs:1
+# CHECK-NEXT:        8: 2e 55                            pushq   %rbp
+# CHECK-NEXT:        a: 2e 55                            pushq   %rbp
+# CHECK-NEXT:        c: 2e 55                            pushq   %rbp
 # CHECK-NEXT:        e: 48 89 e5                         movq    %rsp, %rbp

Is there performance benefit to add 4 prefixes to the same instruction?

llvm/test/MC/X86/x86-64-align-branch-1c.s
2 ↗(On Diff #232116)

The difference between 1a and 1c is that 1c does not allow "jmp", but in 1a no jmp instructions get a prefix in the test, so it is unclear why 1c has different output.

llvm/test/MC/X86/x86-64-align-branch-1d.s
4 ↗(On Diff #232116)

Delete

llvm/test/MC/X86/x86-64-align-branch-1e.s
46 ↗(On Diff #232116)

This is weird. Comparing this with 1d, 1e allows more instruction types, yet it inserts two NOPs which actually seems to degrade performance.

llvm/test/MC/X86/x86-64-align-branch-1f.s
8 ↗(On Diff #232116)

No disassembly is needed. Just check that --x86-align-branch-boundary=0 and the default (no x86- specific options) have the identical output (cmp %t %t2)

llvm/test/MC/X86/x86-64-align-branch-1g.s
1 ↗(On Diff #232116)

Merge 1e and 1g. State that -mcpu=x86-64 generates 66 90 instead of 90 90 (but why?)

3 ↗(On Diff #232116)

Delete

7 ↗(On Diff #232116)

Delete Disassembly of section .text:

I find another deficiency (infinite loop) with the current approach.

Say, there is a je 0 (0x0F 0x84 0x00 0x00 0x00 0x00) at byte 0x90. (0x90+6)%32 == 0, so it ends on a 32-byte boundary.
MF.getMaxPrefixSize() is 4, so the size of MCMachineDependentFragment may vary from 0 to 4.
If there are other MCMachineDependentFragment's in the section, some may shrink while some may expand.
In some cases the following loop will not converge

bool MCAssembler::layoutOnce(MCAsmLayout &Layout) {
  ++stats::RelaxationSteps;

  bool WasRelaxed = false;
  for (iterator it = begin(), ie = end(); it != ie; ++it) {
    MCSection &Sec = *it;
    while (layoutSectionOnce(Layout, Sec)) ///
      WasRelaxed = true;
  }

  return WasRelaxed;
}

// In MCAssembler::layoutSectionOnce,
  case MCFragment::FT_MachineDependent:
    RelaxedFrag =
        relaxMachineDependent(Layout, *cast<MCMachineDependentFragment>(I));
    break;

To give a concrete example, clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge. You may also try dtor-*.cpp in that directory.

A simple iterative algorithm is not guaranteed to converge. We probably can solve the layout problem with a dynamic programming algorithm:

f[i][j] = the minimum cost that layouts the first i instructions with j extra bytes (via NOPs or prefixes)
or
g[i][j] = the minimum inserted bytes that layouts the first i instructions with cost j

I am not clear which one is better. A simple greedy approach is to set an upper limit on the number of iterations when the section contains at least one MCMachineDependentFragment, i.e.

bool HasMCMachineDependentFragment = false;
int count = 5; // arbitrary. Please find an appropriate value.
while (layoutSectionOnce(Layout, Sec, HasMCMachineDependentFragment) && count > 0) {
  WasRelaxed = true;
  if (HasMCMachineDependentFragment)
    count--;
}
MaskRay added inline comments.Wed, Dec 4, 5:59 PM
llvm/lib/MC/MCAssembler.cpp
986

Division is slow. Pass in the power of 2 and use right shift instead.

You may change MCMachineDependentFragment::AlignBoundarySize (a power of 2) to a power.

994

Ditto.

skan updated this revision to Diff 232326.Thu, Dec 5, 5:52 AM
skan marked 5 inline comments as done.Thu, Dec 5, 5:56 AM
skan added a comment.Thu, Dec 5, 7:41 AM

I find another deficiency (infinite loop) with the current approach.

Say, there is a je 0 (0x0F 0x84 0x00 0x00 0x00 0x00) at byte 0x90. (0x90+6)%32 == 0, so it ends on a 32-byte boundary.
MF.getMaxPrefixSize() is 4, so the size of MCMachineDependentFragment may vary from 0 to 4.
If there are other MCMachineDependentFragment's in the section, some may shrink while some may expand.
In some cases the following loop will not converge

bool MCAssembler::layoutOnce(MCAsmLayout &Layout) {
  ++stats::RelaxationSteps;

  bool WasRelaxed = false;
  for (iterator it = begin(), ie = end(); it != ie; ++it) {
    MCSection &Sec = *it;
    while (layoutSectionOnce(Layout, Sec)) ///
      WasRelaxed = true;
  }

  return WasRelaxed;
}
 
// In MCAssembler::layoutSectionOnce,
  case MCFragment::FT_MachineDependent:
    RelaxedFrag =
        relaxMachineDependent(Layout, *cast<MCMachineDependentFragment>(I));
    break;

To give a concrete example, clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge. You may also try dtor-*.cpp in that directory.

A simple iterative algorithm is not guaranteed to converge. We probably can solve the layout problem with a dynamic programming algorithm:

f[i][j] = the minimum cost that layouts the first i instructions with j extra bytes (via NOPs or prefixes)
or
g[i][j] = the minimum inserted bytes that layouts the first i instructions with cost j

I am not clear which one is better. A simple greedy approach is to set an upper limit on the number of iterations when the section contains at least one MCMachineDependentFragment, i.e.

bool HasMCMachineDependentFragment = false;
int count = 5; // arbitrary. Please find an appropriate value.
while (layoutSectionOnce(Layout, Sec, HasMCMachineDependentFragment) && count > 0) {
  WasRelaxed = true;
  if (HasMCMachineDependentFragment)
    count--;
}

I guess you originally wanted to say (90 + 6)%32 == 0 instead of (0x90 +6)%32 == 0. With the last patch, the command

clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge, but this is not the fault of this simple iterative algorithm.

Let me briefly introduce this algorithm. Regardless of whether the prefix or nop is filled in, MCMachineDependentFragment is used to occupy a certain size of space to align the branch, and it has a pointer to the branch it is responsible for. Multiple MCMachineDependentFragments can work together to align a branch. For example, there are two MCMachineDependentFragment, M1 and M2 to align branch J1. In the first iteration, J1 needs 7 bytes of padding, so the size of M1 will grow to 7 as much as possible. However, M1 can only grow to 5 or smaller from some reason. At this time, M2 will grow as much as possible to 2. In the second iteration, when it's M1's turn to relax, we will subtract the size of M1 and M2 from the current address of J1 to get the size that J1 needs to padding, assuming it is 6, and then M1 will keep the size to 5 and the size of M2 will be reduced to 1. It is not difficult to see that among the MCFragment from M1 to J1, as long as the size of MCFragment other than M1 and M2 is fixed within a limited number of iterations, the sizes of M1 and M2 will be fixed within a limited number of iterations, that is, the iteration will converge.

As far as I know, the MCFragment used to store instructions includes MCDataFragment, MCRelaxableFragment and MCCompactEncodedFragment. MCDataFragment and MCCompactEncodedFragment have fixed sizes. MCRelaxableFragment is used to store instructions of variable size and will only grow and not shrink, as a result, its size will be fixed within a limited number of cycles . So as long as we ensure that there is only MCFragment that stores instructions between the instruction to be prefixed and the branch to be aligned, this iterative algorithm will converge.

The file cxa_atexit.s has more than one text section. The instruction to be prefixed and the branch to align may be in two different text sections. I forgot to check if there is a MCFragment not storing instruction between the them, which results in non-convergence. This bug has been fixed in the latest patch.

skan marked 3 inline comments as done.Thu, Dec 5, 8:01 AM
skan added inline comments.
llvm/test/MC/X86/x86-64-align-branch-1b.s
10 ↗(On Diff #232116)

Thanks for the knowledge! There is no performance benefit to add 4 prefixes to the same instruction. The instruction movl %eax, %fs:1 has already has a prefix %fs (0x64), if option -x86-align-branch-prefix-size=5 is used, we could add 4 more prefixes at most. The branch to be aligned needs 3-byte padding, so 3 prefixes are added to the move instruction.

llvm/test/MC/X86/x86-64-align-branch-1c.s
2 ↗(On Diff #232116)
# CHECK-NEXT:       45: 2e 89 45 fc                      movl    %eax, %cs:-4(%rbp)
# CHECK-NEXT:       49: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       4c: 89 7d f8                         movl    %edi, -8(%rbp)
# CHECK-NEXT:       4f: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       52: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       55: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       58: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       5b: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       5e: 5d                               popq    %rbp
# CHECK-NEXT:       5f: 5d                               popq    %rbp
# CHECK-NEXT:       60: eb 26                            jmp     {{.*}}

The prefix 2e is added at

45: 2e 89 45 fc                      movl    %eax, %cs:-4(%rbp)
llvm/test/MC/X86/x86-64-align-branch-1e.s
46 ↗(On Diff #232116)

Not all the target cpu support long nop, you can find related code in X86AsmBackend::writeNopData.

FYI: I did close the bug as fixed after verifying that the fix works for me.

reames added a comment.Thu, Dec 5, 2:38 PM

We uncovered another functional issue with this patch, or at least, the interaction of this patch and other parts of LLVM. In our support for STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions of code which might be patched out. It's important that these regions are exactly N bytes as concurrent patching which doesn't replace an integral number of instructions is ill-defined on X86-64. This patch causes the N-byte nop sequence to sometimes become (N+M) bytes which breaks the patching. I believe that the XRAY support may have a similar issue.

More generally, I'm worried about the legality of arbitrarily prefixing instructions from unknown sources. In the particular example we saw, we had something along the following:

.Ltmp0:
.p2align 3, 0x90
(16 byte nop sequence)
.Ltmp3:
jmp *%rax

In addition to the patching legality issue above, padding the nop sequence does something else interesting in this example. It changes the alignment of Ltmp3. Before, Ltmp3 was always 8 byte aligned, after prefixes are added, it's not. It's not clear to me exactly what the required semantics here are, but we at least had been assuming the alignment of Ltmp3 was guaranteed in this case. (That's actually how we found the patching issue.)

reames added a comment.Thu, Dec 5, 3:14 PM

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

MaskRay added a subscriber: opaparo.Thu, Dec 5, 3:41 PM

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

CC the author of D34393 - @opaparo for MCCodePadding. Intel folks may know how to contact @opaparo?

I also noticed that MCCodePadder.cpp is never updated (except a license change) after the initial check-in.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

reames added a comment.Thu, Dec 5, 5:43 PM

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

I can definitely see removing the code padding stuff, since it's unused and untested.

As for the bundle mechanisms, why? It seems like exactly what we're going to want here. Regardless of the auto-detect feature, we're going to need a representation of a bundle which needs to be properly placed to avoid splitting, and the current code does that. Why not reuse the, presumable reasonable well tested, existing infrastructure? The only extra thing we seem to need is the ability to toggle off bundle formation for instruction types we don't care about. Since we're going to need an assembler spelling of that regardless, it seems like the current code is a decent baseline?

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

I can definitely see removing the code padding stuff, since it's unused and untested.

As for the bundle mechanisms, why? It seems like exactly what we're going to want here. Regardless of the auto-detect feature, we're going to need a representation of a bundle which needs to be properly placed to avoid splitting, and the current code does that. Why not reuse the, presumable reasonable well tested, existing infrastructure? The only extra thing we seem to need is the ability to toggle off bundle formation for instruction types we don't care about. Since we're going to need an assembler spelling of that regardless, it seems like the current code is a decent baseline?

I created D71106 to delete MCCodePadder and accompanying classes.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

I think it's a good suggestion to start with NOP padding as the first step. In our previous experiment, we saw that the prefix padding was slight better than NOP padding, but not much. We will retest the NOP padding and go back to you.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

I think it's a good suggestion to start with NOP padding as the first step. In our previous experiment, we saw that the prefix padding was slight better than NOP padding, but not much. We will retest the NOP padding and go back to you.

For whatever it may be worth: Agnor Fog's empirical research on x86 pipelines and his review of manufacturer optimization guidelines also concludes that prefixes are often preferable to NOPs on modern x86 processors. (See: https://www.agner.org/optimize/microarchitecture.pdf) This arguably isn't surprising given that the decoder needs to be good at finding instruction boundaries but the decoder isn't responsible for interpreting instructions, therefore NOPs of any size dilute decode bandwidth.

Recording something so I don't forget it when we get back to the prefix padding version. The write up on the bundle align mode stuff mentions a concerning memory overhead for the feature. Since the basic implementation techniques are similar, we need to make sure we assess the memory overhead of the prefix padding implementation. See https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm for context. I don't believe this is likely to be an issue for the nop padding variant.

Recording something so I don't forget it when we get back to the prefix padding version. The write up on the bundle align mode stuff mentions a concerning memory overhead for the feature. Since the basic implementation techniques are similar, we need to make sure we assess the memory overhead of the prefix padding implementation. See https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm for context. I don't believe this is likely to be an issue for the nop padding variant.

From the doc of https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm the ".bundle_align_mode" ensure each instruction that following the directive don't cross the alignment boundary and ensure the first instruction following the directive be aligned, but in this patch we only require branch (jcc, jmp, ...) instruction don't cross the alignment boundary. Another remind, this patch avoid branch instruction hit the alignment boundary, and bundle syntax doesn't support that (see section 2.1 of https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf). So I don't think bundle syntax fit current requirement perfectly.

Branch alignment


The primary goal of this patch, restricting the placement of branch instructions, is a performance optimization. Similar to loop alignment, the desire is to increase speed, at the cost of code-size. However, the way this feature has been designed is a global assembler flag. I find that not ideal, because it cannot take into account hotness of a block/function, as for example loop alignment code does. Basic-block alignment of loops is explicitly opt-in on an block-by-block basis -- the compiler simply emits a p2align directive where it needs, and the assembler honors that. And so, MachineBlockPlacement::alignBlocks has a bunch of conditions under which it will avoid emitting a p2align. This seems like a good model -- the assembler does what it's told by the compiler (or assembly-writer). Making the branch-instruction-alignment work similarly seems like it would be good.

IMO it would be nicest if there could be a directive that requests to specially-align the next instruction. However, the fused-jcc case makes that quite tricky, so perhaps this ought to also be a mode which can be enabled/disabled on a region as well.

Yes, the primary goal of this patch is a performance optimization or mitigation. The intention is to provide a simple method for users to mitigate the performance impact of JCC MCU with less effort. We also provide users several options to tune the performance. But the basic idea is to make it easy for users to mitigate it and improve the performance.

Your proposal is a good idea. But I'm afraid it may not cover all the scenarios. Firstly, the proposal replies on compiler to detect the hotspots. But the compiler needs LTO and/or PGO to get the precise hot spots. Otherwise, if the compiler misses the hot spots which impact the application performance, the users have no way but have to insert the directives manually. Secondly, for the existing codes written in assembly, the compiler can't handle it. The users have to insert the directives by hand, which are pretty much work.

I think the current patch wants to give a simple and general solution to mitigate JCC MCU performance impact to both C/C++ and Assembly. And it doesn't need any source code change. I think your proposal will be a good enhancement on top of it.

skan added a comment.Sun, Dec 8, 11:43 PM

We uncovered another functional issue with this patch, or at least, the interaction of this patch and other parts of LLVM. In our support for STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions of code which might be patched out. It's important that these regions are exactly N bytes as concurrent patching which doesn't replace an integral number of instructions is ill-defined on X86-64. This patch causes the N-byte nop sequence to sometimes become (N+M) bytes which breaks the patching. I believe that the XRAY support may have a similar issue.

More generally, I'm worried about the legality of arbitrarily prefixing instructions from unknown sources. In the particular example we saw, we had something along the following:

.Ltmp0:

	.p2align	3, 0x90
	(16 byte nop sequence)

.Ltmp3:

	jmp *%rax

In addition to the patching legality issue above, padding the nop sequence does something else interesting in this example. It changes the alignment of Ltmp3. Before, Ltmp3 was always 8 byte aligned, after prefixes are added, it's not. It's not clear to me exactly what the required semantics here are, but we at least had been assuming the alignment of Ltmp3 was guaranteed in this case. (That's actually how we found the patching issue.)

I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with your example. So according to my understanding, I slightly modified your case. (If my understand is wrong, I hope you can point it out :-). )

    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)
    .rept 2
    nop
    .endr
    jmp .Ltmp0

The instruction jmp .Ltmp0 starts at byte 0x1e and ends at byte 0x20. If we align the jump with preifx, two prefixes will be added to the .rept2 16 nop .endr. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the label '.Ltmp3' is not 8-byte aligned any more.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since the alignment is not explicitly required. For example, I think we can not assuming the instruction jmp .Ltmp0 always starts at byte 0x1e. And in fact, as long as the binary generated by the compiler does not have a one-to-one correspondence with the assembly code, at least one implict assumption of an instruction will be broken.

I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with your example. So according to my understanding, I slightly modified your case. (If my understand is wrong, I hope you can point it out :-). )

    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)

In our case it was

andl $1, %eax

but it does not matter that much.

.rept 2
nop
.endr
jmp .Ltmp0
The instruction `jmp .Ltmp0` starts at byte 0x1e and ends at byte 0x20.

Again, in our particular case start of the sequence was at xxx8, so 8 + 16(our sequence) + 3(andl) + 5(jmp) == 32.

If we align the jump with preifx, two prefixes will be added to the .rept2 16 nop .endr. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the label '.Ltmp3' is not 8-byte aligned any more.

Yes, thats what happened.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since the alignment is not explicitly required.

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

annita.zhang added a comment.EditedMon, Dec 9, 5:14 AM

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

What if I insert explicit align(8) right *after* the sequence?

reames added a comment.Mon, Dec 9, 5:17 PM

I just posted an alternate review (https://reviews.llvm.org/D71238) which attempts to carve out a minimum reviewable piece of complexity. The hope is that we can review that one quickly (as there are fewer interacting concerns), and then rebase this one (possibly splitting further).

I had previously suggested in review comments that we should reuse the infrastructure from .bundle_align_mode. When I sat down to actually implement that, I discovered that the code for that has a bunch of interacting assumptions about when fragments are constructed and used vs alignment boundaries. I got a version of this working, but the complexity was worrisome. I now suggest that we should take the rough approach sketched here (a separate fragment before the one being aligned), delete the essentially unused bundle mode code, and revisit a unified representation if needed for memory density at a later time. (i.e. my previous suggestion wasn't a good one)

skan added a comment.Mon, Dec 9, 5:27 PM

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

skan updated this revision to Diff 233008.Mon, Dec 9, 11:48 PM

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

Well, I kinda figure that from the code behavior right now, however is it really by design or just happens to work now?
Seeing that assembler becomes very intelligent now I would rather have a strict guarantee similar to "hardcode" thing that you have here that protects my sequence
rather than relying on a fact that in current settings moving my label by 8 does not appear to be profitable to assembler.

skan added a comment.Wed, Dec 11, 10:40 PM

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

Well, I kinda figure that from the code behavior right now, however is it really by design or just happens to work now?

It is by design, prefix won't be added to instruction if there is a align directive int the path to the target branch.

Seeing that assembler becomes very intelligent now I would rather have a strict guarantee similar to "hardcode" thing that you have here that protects my sequence
rather than relying on a fact that in current settings moving my label by 8 does not appear to be profitable to assembler.

It depends on what your real need is. If you just need your original implicit alignment to work, turning it into explicit alignment is enough. Or if you need a sequence not changed by
the assembler at all, maybe we need a new directive such as ".hard " or ".bundle".

skan updated this revision to Diff 233545.Thu, Dec 12, 2:45 AM
  1. Replace divide and mod operation with bitwise operation
  2. Refine help information