This is an archive of the discontinued LLVM Phabricator instance.

Align branches within 32-Byte boundary(NOP padding)
ClosedPublic

Authored by skan on Nov 12 2019, 6:50 PM.

Details

Summary

Microcode update for Jump Conditional Code Erratum may cause performance
loss for some workloads:

https://www.intel.com/content/www/us/en/support/articles/000055650.html

Here is the patch to mitigate performance impact by aligning branches
within 32-byte boundary. The impacted instructions are:

a. Conditional jump.
b. Fused conditional jump.
c. Unconditional jump.
d. Indirect jump.
e. Ret.
f. Call.

Add two options for llvm-mc:

  1. -x86-align-branch-boundary=NUM aligns branches within NUM byte boundary.
  2. -x86-align-branch=TYPE[+TYPE...] specifies types of branches to align.

to align branches within a 32-Byte boundary to reduce the potential performance
loss of the microcode update.

A new MCFragment type, MCBoundaryAlignFragment, is added, which may emit
NOP to align the fused/unfused branch.

alignBranchesBegin inserts MCBoundaryAlignFragment before instructions,
alignBranchesEnd marks the end of the branch to be aligned,
relaxBoundaryAlign grows or shrinks sizes of NOP to align the target branch.

Nop padding is disabled when the instruction may be rewritten by the linker,
such as TLS Call.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jyknight added inline comments.Nov 15 2019, 1:30 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
427

This set of functions down to isIndirectBranch() seems unnecessary. Pushing one line

const MCInstrDesc &InstDesc = MCII.get(Inst.getOpcode());

into needAlign(const MCInst &Inst), and then just using InstDesc.isReturn() etc. would be fine.

472

please run something like "git clang-format HEAD~1" to re-format your patch.

476

Comment on why this is doing what it's doing?

782–783

Why?

llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
106 ↗(On Diff #229511)

"MF" is not a readable abbreviation here -- spell out "MacroFusion" instead. It was OK locally in the "X86MacroFusion.cpp" file, as the filename gave you a hint, but not here, where you have no such hint.

134 ↗(On Diff #229511)

This function is also named poorly, after being moved to this more-generic location. "Classify" doesn't tell me anything about this being related to macro-fusion.

skan updated this revision to Diff 229689.Nov 16 2019, 6:04 AM
skan marked 11 inline comments as done.Nov 17 2019, 11:41 PM
skan updated this revision to Diff 230241.Nov 20 2019, 5:43 AM
skan updated this revision to Diff 230378.Nov 20 2019, 11:35 PM
skan edited the summary of this revision. (Show Details)

Not insert nop or prefix if the previous item is hard code, which may be
used to hardcode an instruction, since there is no clear instruction boundary.

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

  • Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:
    • For most instructions, that'll be BranchPrefix.
    • For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit
    • For fused conditionals, it'll be FusedJccPadding.
  • After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.
  • Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

llvm/lib/MC/MCAssembler.cpp
1030

I don't think this is necessary. AFAICT, the symbols should already be in the right place -- pointing to the relax fragment, not the instruction itself, without this. And removing all this moveSymbol/updateSymbolMap code doesn't make any tests fail.

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

  • Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:
    • For most instructions, that'll be BranchPrefix.
    • For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit
    • For fused conditionals, it'll be FusedJccPadding.
  • After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.
  • Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I don't have any numbers myself. I was only involved in some of the code review internally. My understanding is that NOP instructions would place extra nop uops into the DSB(the decoded uop buffer) and that limits the performance that can be recovered. By using redundant prefixes no extra uops are generated and more performance is recovered.

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

efriedma added inline comments.
llvm/include/llvm/MC/MCFragment.h
591

Global variables are forbidden in LLVM libraries; there could be multiple LLVMContexts in the same process.

skan marked an inline comment as done.Nov 21 2019, 8:26 PM

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

  • Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:
    • For most instructions, that'll be BranchPrefix.
    • For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit
    • For fused conditionals, it'll be FusedJccPadding.
  • After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.
  • Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

Thanks for your detailed and accurate analysis for my code! I am sorry that this should have been done by me.

llvm/lib/MC/MCAssembler.cpp
1030

Yes, I check it and you are right. I will removing all this moveSymbol/updateSymbolMap code.

skan updated this revision to Diff 230605.Nov 22 2019, 1:12 AM
skan edited the summary of this revision. (Show Details)

Three changes are made:

  1. Remove moveSymbol/updateSymbolMap code since it is not necessary
  2. Make variable AlignBoundarySize and variable AlignMaxPrefixSize not global
  3. Disable nop padding before instruction with variant symbol operand since it may be rewritten by linker.
skan updated this revision to Diff 230617.EditedNov 22 2019, 2:25 AM

Speed up if we only want to simply emit NOPs before branches as needed.

skan added a comment.Nov 22 2019, 3:01 AM

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

Thanks for your remind! Now, if -malign-branch-prefix-size=0 is used, the method will only insert BranchPadding or FusedJccPadding before branches as needed, and insert BranchPrefix before the instruction which is macro fusible but not macro fused. In this condition, the operation is as simple as inserting nop only. Since more performance can be recovered by inserting prefixes than inserting nops, I believe I should support prefix padding.

skan edited the summary of this revision. (Show Details)Nov 22 2019, 3:48 AM

(Just a reminder that we need to have both performance and code size numbers for this patch. And given that there are a few options, may need a few examples.)

I have some other concerns about the code itself, but after pondering this a little bit, I'd like to instead bring up some rather more general concerns about the overall approach used here -- with some suggestions. (As these comments are not really comments on the _code_, but on the overall strategy, they also apply to the gnu binutils patch for this same feature.)

Of course, echoing chandler, it would be nice to see some performance numbers, otherwise it's not clear how useful any of this is.

Segment-prefix padding

The ability to pad instructions instead of adding a multibyte-NOP in order to create alignment seems like a generally-useful feature, which should be usable in other situations where we align within a function -- assuming that there is indeed a measurable performance benefit vs NOP instructions. E.g. we should do this for basic-block alignment, as well! As such, it feels like this feature ought to be split out, and implemented separately from the new branch-alignment functionality -- in a way which is usable for any kind of alignment request.

The way I'd imagine it working is to introduce a new pair of asm directives, to enable and disable segment-prefix padding in a certain range of instructions (let's say ".enable_instr_prefix_pad", ".disable_instr_prefix_pad". Within such an enabled range, the assembler would prefix instructions as required in order to handle later nominally-nop-emitting code alignment directives (such as the usual '.p2align 4, 0x90') .

Branch alignment

The primary goal of this patch, restricting the placement of branch instructions, is a performance optimization. Similar to loop alignment, the desire is to increase speed, at the cost of code-size. However, the way this feature has been designed is a global assembler flag. I find that not ideal, because it cannot take into account hotness of a block/function, as for example loop alignment code does. Basic-block alignment of loops is explicitly opt-in on an block-by-block basis -- the compiler simply emits a p2align directive where it needs, and the assembler honors that. And so, MachineBlockPlacement::alignBlocks has a bunch of conditions under which it will avoid emitting a p2align. This seems like a good model -- the assembler does what it's told by the compiler (or assembly-writer). Making the branch-instruction-alignment work similarly seems like it would be good.

IMO it would be nicest if there could be a directive that requests to specially-align the next instruction. However, the fused-jcc case makes that quite tricky, so perhaps this ought to also be a mode which can be enabled/disabled on a region as well.

Enabling by default

Previously, I'd mentioned that it seemed likely we couldn't actually enable branch-alignment by default because it'll probably break people's inline-asm and standalone asm files. That would be solved by making everything controllable within the asm file. The compiler could then insert the directives for its own code, and disable it around inline assembly. And standalone asm files could remain unaffected, unless they opt in. With that, we could actually enable the alignment by default, for compiled output in certain cpu-tuning modes, if it's warranted.

reames added a subscriber: reames.Dec 2 2019, 9:02 AM
reames added a comment.Dec 2 2019, 9:34 AM

I want to chime in support of jyknight's meta comments - particularly the one about the need to balance execution speed vs code size differently in hot vs cold code. For our use case, we have a very large amount of branch dense known cold paths, and being able to only align fast path branches would be a substantial space savings.

I also see value in having the prefix padding feature factored out generically. If that mechanism is truly measurably faster than multi-byte nops - which if I reading comments correctly, has been claimed but not documented or measured? - using it generically for other alignment purposes would likely be worthwhile.

I'd also like to see - probably in a separate patch - support for auto-detecting whether the host CPU needs this mitigation. Both -mcpu=native and various JITs will end up needing this, having the code centralized in one place would be good.

Just FYI for now as I'm trying to dig futher...
I have been trying this fix in our downstream environment and managed to get a hang with this backtrace:

#0 needPadding (BoundarySize=<optimized out>, Size=<optimized out>, StartAddr=<optimized out>) at ./llvm/lib/MC/MCAssembler.cpp:1028
#1 llvm::MCAssembler::relaxMachineDependent (this=<optimized out>, Layout=..., MF=...) at ./llvm/lib/MC/MCAssembler.cpp:1077
#2 0x00007ffff1adc8b6 in llvm::MCAssembler::layoutSectionOnce (this=this@entry=0x7fff90580580, Layout=..., Sec=...) at ./orca/llvm/lib/MC/MCAssembler.cpp:1213
...

Working on getting upstream reproducer....

fedor.sergeev requested changes to this revision.EditedDec 3 2019, 2:42 PM

Working on getting upstream reproducer....

Hangs on a moderately small piece (~150 lines) of x86 assembly, filed a bug on it here:

https://bugs.llvm.org/show_bug.cgi?id=44215

This revision now requires changes to proceed.Dec 3 2019, 2:42 PM

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

craig.topper added inline comments.Dec 3 2019, 10:23 PM
llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
158 ↗(On Diff #232032)

None of the AND/ADD/SUB instructions ending in mr are eligible for macrofusion as far as I know. Those all involve a load and a store which is not supported by macrofusion.

We also lost all the ADD*_DB instructions from the macrofusion list. I believe they are in the existing list incorrectly. So removing them is correct, but as far as I can see that change was not mentioned in the description of this patch.

Can we split the macrofusion refactoring out of this patch so we can review it separately and hopefully get it committed sooner than the other review feedback.

annita.zhang added a comment.EditedDec 3 2019, 11:29 PM

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

Sorry for belated response. We're working hard to go through some paper work to get the performance data ready. I think maybe it's better to open a mailing thread in llvm-dev to post those performance data and discuss those suggestions.

The first data was posted in http://lists.llvm.org/pipermail/llvm-dev/2019-December/137413.html.

Thanks,
Annita

skan updated this revision to Diff 232045.Dec 4 2019, 12:17 AM

Fix the macro fusion table. If the first source/destination operand of an instruction is memory, it is unfusible.

skan marked an inline comment as done.Dec 4 2019, 12:20 AM
skan added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
158 ↗(On Diff #232032)

Okay, I will upload another patch to correct the macrofusion table as soon as possible.

Can you please put the macro fusion changes in a separate phabricator review. I’ll review it in the morning US time and if it all looks good we can get that part committed while the other comments are being addressed.

skan added a comment.Dec 4 2019, 12:47 AM

Can you please put the macro fusion changes in a separate phabricator review. I’ll review it in the morning US time and if it all looks good we can get that part committed while the other comments are being addressed.

Sure.

skan updated this revision to Diff 232068.Dec 4 2019, 3:08 AM

We shouldn't try to insert prefix/nop t in a virtual section.

skan updated this revision to Diff 232116.Dec 4 2019, 6:33 AM

Since hardcode only exists in text section, we should not insert HardCodeBegin/HardCodeEnd in other section, such as a virtual section.

spatel added a subscriber: spatel.Dec 4 2019, 6:57 AM

I am still trying to understand the patch. Just made some comments about the tests.

llvm/include/llvm/MC/MCFragment.h
570

Don’t duplicate function or class name at the beginning of the comment (BranchPadding - ). (ref: https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments)

581

Full stop.

616

Move llvm_unreachable below the switch, otherwise clang will give a warning:

warning: default label in switch which covers all enumera

tion values [-Wcovered-switch-default]

Unfortunately all GCC (even 9) -Wall will warn warning: control reaches end of non-void function [-Wreturn-type] unless you place an unreachable statement.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
545

Space after if

llvm/test/MC/X86/x86-64-align-branch-1a.s
1 ↗(On Diff #232116)

1a~1g use the same source file. Move the source to Inputs/align-branch-1-64.s

According to the local naming convention, this test should probably be renamed to align-branch-1-64.s

7 ↗(On Diff #232116)

Delete

11 ↗(On Diff #232116)

Delete Disassembly of section .text:. Ditto below.

llvm/test/MC/X86/x86-64-align-branch-1b.s
4 ↗(On Diff #232116)

Delete

10 ↗(On Diff #232116)

I think 1a.s and 1b.s should be merged. FileCheck supports

--check-prefixes=CHECK,PREFIX5
--check-prefixes=CHECK,PREFIX1

CHECK: common part
CHECK-NEXT: common part
PREFIX5:
PREFIX5-NEXT:
PREFIX1:
PREFIX1-NEXT:
CHECK:
CHECK-NEXT:
% diff -U1 x86-64-align-branch-1[ab].s
 # CHECK: 0000000000000000 foo:
-# CHECK-NEXT:        0: 64 64 64 64 89 04 25 01 00 00 00 movl    %eax, %fs:1
-# CHECK-NEXT:        b: 55                               pushq   %rbp
-# CHECK-NEXT:        c: 55                               pushq   %rbp
-# CHECK-NEXT:        d: 55                               pushq   %rbp
+# CHECK-NEXT:        0: 64 89 04 25 01 00 00 00          movl    %eax, %fs:1
+# CHECK-NEXT:        8: 2e 55                            pushq   %rbp
+# CHECK-NEXT:        a: 2e 55                            pushq   %rbp
+# CHECK-NEXT:        c: 2e 55                            pushq   %rbp
 # CHECK-NEXT:        e: 48 89 e5                         movq    %rsp, %rbp

Is there performance benefit to add 4 prefixes to the same instruction?

llvm/test/MC/X86/x86-64-align-branch-1c.s
2 ↗(On Diff #232116)

The difference between 1a and 1c is that 1c does not allow "jmp", but in 1a no jmp instructions get a prefix in the test, so it is unclear why 1c has different output.

llvm/test/MC/X86/x86-64-align-branch-1d.s
4 ↗(On Diff #232116)

Delete

llvm/test/MC/X86/x86-64-align-branch-1e.s
46 ↗(On Diff #232116)

This is weird. Comparing this with 1d, 1e allows more instruction types, yet it inserts two NOPs which actually seems to degrade performance.

llvm/test/MC/X86/x86-64-align-branch-1f.s
8 ↗(On Diff #232116)

No disassembly is needed. Just check that --x86-align-branch-boundary=0 and the default (no x86- specific options) have the identical output (cmp %t %t2)

llvm/test/MC/X86/x86-64-align-branch-1g.s
1 ↗(On Diff #232116)

Merge 1e and 1g. State that -mcpu=x86-64 generates 66 90 instead of 90 90 (but why?)

3 ↗(On Diff #232116)

Delete

7 ↗(On Diff #232116)

Delete Disassembly of section .text:

I find another deficiency (infinite loop) with the current approach.

Say, there is a je 0 (0x0F 0x84 0x00 0x00 0x00 0x00) at byte 0x90. (0x90+6)%32 == 0, so it ends on a 32-byte boundary.
MF.getMaxPrefixSize() is 4, so the size of MCMachineDependentFragment may vary from 0 to 4.
If there are other MCMachineDependentFragment's in the section, some may shrink while some may expand.
In some cases the following loop will not converge

bool MCAssembler::layoutOnce(MCAsmLayout &Layout) {
  ++stats::RelaxationSteps;

  bool WasRelaxed = false;
  for (iterator it = begin(), ie = end(); it != ie; ++it) {
    MCSection &Sec = *it;
    while (layoutSectionOnce(Layout, Sec)) ///
      WasRelaxed = true;
  }

  return WasRelaxed;
}

// In MCAssembler::layoutSectionOnce,
  case MCFragment::FT_MachineDependent:
    RelaxedFrag =
        relaxMachineDependent(Layout, *cast<MCMachineDependentFragment>(I));
    break;

To give a concrete example, clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge. You may also try dtor-*.cpp in that directory.

A simple iterative algorithm is not guaranteed to converge. We probably can solve the layout problem with a dynamic programming algorithm:

f[i][j] = the minimum cost that layouts the first i instructions with j extra bytes (via NOPs or prefixes)
or
g[i][j] = the minimum inserted bytes that layouts the first i instructions with cost j

I am not clear which one is better. A simple greedy approach is to set an upper limit on the number of iterations when the section contains at least one MCMachineDependentFragment, i.e.

bool HasMCMachineDependentFragment = false;
int count = 5; // arbitrary. Please find an appropriate value.
while (layoutSectionOnce(Layout, Sec, HasMCMachineDependentFragment) && count > 0) {
  WasRelaxed = true;
  if (HasMCMachineDependentFragment)
    count--;
}
MaskRay added inline comments.Dec 4 2019, 5:59 PM
llvm/lib/MC/MCAssembler.cpp
986

Division is slow. Pass in the power of 2 and use right shift instead.

You may change MCMachineDependentFragment::AlignBoundarySize (a power of 2) to a power.

994

Ditto.

skan updated this revision to Diff 232326.Dec 5 2019, 5:52 AM
skan marked 5 inline comments as done.Dec 5 2019, 5:56 AM
skan added a comment.Dec 5 2019, 7:41 AM

I find another deficiency (infinite loop) with the current approach.

Say, there is a je 0 (0x0F 0x84 0x00 0x00 0x00 0x00) at byte 0x90. (0x90+6)%32 == 0, so it ends on a 32-byte boundary.
MF.getMaxPrefixSize() is 4, so the size of MCMachineDependentFragment may vary from 0 to 4.
If there are other MCMachineDependentFragment's in the section, some may shrink while some may expand.
In some cases the following loop will not converge

bool MCAssembler::layoutOnce(MCAsmLayout &Layout) {
  ++stats::RelaxationSteps;

  bool WasRelaxed = false;
  for (iterator it = begin(), ie = end(); it != ie; ++it) {
    MCSection &Sec = *it;
    while (layoutSectionOnce(Layout, Sec)) ///
      WasRelaxed = true;
  }

  return WasRelaxed;
}

// In MCAssembler::layoutSectionOnce,
  case MCFragment::FT_MachineDependent:
    RelaxedFrag =
        relaxMachineDependent(Layout, *cast<MCMachineDependentFragment>(I));
    break;

To give a concrete example, clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge. You may also try dtor-*.cpp in that directory.

A simple iterative algorithm is not guaranteed to converge. We probably can solve the layout problem with a dynamic programming algorithm:

f[i][j] = the minimum cost that layouts the first i instructions with j extra bytes (via NOPs or prefixes)
or
g[i][j] = the minimum inserted bytes that layouts the first i instructions with cost j

I am not clear which one is better. A simple greedy approach is to set an upper limit on the number of iterations when the section contains at least one MCMachineDependentFragment, i.e.

bool HasMCMachineDependentFragment = false;
int count = 5; // arbitrary. Please find an appropriate value.
while (layoutSectionOnce(Layout, Sec, HasMCMachineDependentFragment) && count > 0) {
  WasRelaxed = true;
  if (HasMCMachineDependentFragment)
    count--;
}

I guess you originally wanted to say (90 + 6)%32 == 0 instead of (0x90 +6)%32 == 0. With the last patch, the command

clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge, but this is not the fault of this simple iterative algorithm.

Let me briefly introduce this algorithm. Regardless of whether the prefix or nop is filled in, MCMachineDependentFragment is used to occupy a certain size of space to align the branch, and it has a pointer to the branch it is responsible for. Multiple MCMachineDependentFragments can work together to align a branch. For example, there are two MCMachineDependentFragment, M1 and M2 to align branch J1. In the first iteration, J1 needs 7 bytes of padding, so the size of M1 will grow to 7 as much as possible. However, M1 can only grow to 5 or smaller from some reason. At this time, M2 will grow as much as possible to 2. In the second iteration, when it's M1's turn to relax, we will subtract the size of M1 and M2 from the current address of J1 to get the size that J1 needs to padding, assuming it is 6, and then M1 will keep the size to 5 and the size of M2 will be reduced to 1. It is not difficult to see that among the MCFragment from M1 to J1, as long as the size of MCFragment other than M1 and M2 is fixed within a limited number of iterations, the sizes of M1 and M2 will be fixed within a limited number of iterations, that is, the iteration will converge.

As far as I know, the MCFragment used to store instructions includes MCDataFragment, MCRelaxableFragment and MCCompactEncodedFragment. MCDataFragment and MCCompactEncodedFragment have fixed sizes. MCRelaxableFragment is used to store instructions of variable size and will only grow and not shrink, as a result, its size will be fixed within a limited number of cycles . So as long as we ensure that there is only MCFragment that stores instructions between the instruction to be prefixed and the branch to be aligned, this iterative algorithm will converge.

The file cxa_atexit.s has more than one text section. The instruction to be prefixed and the branch to align may be in two different text sections. I forgot to check if there is a MCFragment not storing instruction between the them, which results in non-convergence. This bug has been fixed in the latest patch.

skan marked 3 inline comments as done.Dec 5 2019, 8:01 AM
skan added inline comments.
llvm/test/MC/X86/x86-64-align-branch-1b.s
10 ↗(On Diff #232116)

Thanks for the knowledge! There is no performance benefit to add 4 prefixes to the same instruction. The instruction movl %eax, %fs:1 has already has a prefix %fs (0x64), if option -x86-align-branch-prefix-size=5 is used, we could add 4 more prefixes at most. The branch to be aligned needs 3-byte padding, so 3 prefixes are added to the move instruction.

llvm/test/MC/X86/x86-64-align-branch-1c.s
2 ↗(On Diff #232116)
# CHECK-NEXT:       45: 2e 89 45 fc                      movl    %eax, %cs:-4(%rbp)
# CHECK-NEXT:       49: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       4c: 89 7d f8                         movl    %edi, -8(%rbp)
# CHECK-NEXT:       4f: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       52: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       55: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       58: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       5b: 89 75 f4                         movl    %esi, -12(%rbp)
# CHECK-NEXT:       5e: 5d                               popq    %rbp
# CHECK-NEXT:       5f: 5d                               popq    %rbp
# CHECK-NEXT:       60: eb 26                            jmp     {{.*}}

The prefix 2e is added at

45: 2e 89 45 fc                      movl    %eax, %cs:-4(%rbp)
llvm/test/MC/X86/x86-64-align-branch-1e.s
46 ↗(On Diff #232116)

Not all the target cpu support long nop, you can find related code in X86AsmBackend::writeNopData.

FYI: I did close the bug as fixed after verifying that the fix works for me.

reames added a comment.Dec 5 2019, 2:38 PM

We uncovered another functional issue with this patch, or at least, the interaction of this patch and other parts of LLVM. In our support for STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions of code which might be patched out. It's important that these regions are exactly N bytes as concurrent patching which doesn't replace an integral number of instructions is ill-defined on X86-64. This patch causes the N-byte nop sequence to sometimes become (N+M) bytes which breaks the patching. I believe that the XRAY support may have a similar issue.

More generally, I'm worried about the legality of arbitrarily prefixing instructions from unknown sources. In the particular example we saw, we had something along the following:

.Ltmp0:
.p2align 3, 0x90
(16 byte nop sequence)
.Ltmp3:
jmp *%rax

In addition to the patching legality issue above, padding the nop sequence does something else interesting in this example. It changes the alignment of Ltmp3. Before, Ltmp3 was always 8 byte aligned, after prefixes are added, it's not. It's not clear to me exactly what the required semantics here are, but we at least had been assuming the alignment of Ltmp3 was guaranteed in this case. (That's actually how we found the patching issue.)

reames added a comment.Dec 5 2019, 3:14 PM

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

MaskRay added a subscriber: opaparo.Dec 5 2019, 3:41 PM

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

CC the author of D34393 - @opaparo for MCCodePadding. Intel folks may know how to contact @opaparo?

I also noticed that MCCodePadder.cpp is never updated (except a license change) after the initial check-in.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

reames added a comment.Dec 5 2019, 5:43 PM

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

I can definitely see removing the code padding stuff, since it's unused and untested.

As for the bundle mechanisms, why? It seems like exactly what we're going to want here. Regardless of the auto-detect feature, we're going to need a representation of a bundle which needs to be properly placed to avoid splitting, and the current code does that. Why not reuse the, presumable reasonable well tested, existing infrastructure? The only extra thing we seem to need is the ability to toggle off bundle formation for instruction types we don't care about. Since we're going to need an assembler spelling of that regardless, it seems like the current code is a decent baseline?

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

I can definitely see removing the code padding stuff, since it's unused and untested.

As for the bundle mechanisms, why? It seems like exactly what we're going to want here. Regardless of the auto-detect feature, we're going to need a representation of a bundle which needs to be properly placed to avoid splitting, and the current code does that. Why not reuse the, presumable reasonable well tested, existing infrastructure? The only extra thing we seem to need is the ability to toggle off bundle formation for instruction types we don't care about. Since we're going to need an assembler spelling of that regardless, it seems like the current code is a decent baseline?

I created D71106 to delete MCCodePadder and accompanying classes.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

I think it's a good suggestion to start with NOP padding as the first step. In our previous experiment, we saw that the prefix padding was slight better than NOP padding, but not much. We will retest the NOP padding and go back to you.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

I think it's a good suggestion to start with NOP padding as the first step. In our previous experiment, we saw that the prefix padding was slight better than NOP padding, but not much. We will retest the NOP padding and go back to you.

For whatever it may be worth: Agnor Fog's empirical research on x86 pipelines and his review of manufacturer optimization guidelines also concludes that prefixes are often preferable to NOPs on modern x86 processors. (See: https://www.agner.org/optimize/microarchitecture.pdf) This arguably isn't surprising given that the decoder needs to be good at finding instruction boundaries but the decoder isn't responsible for interpreting instructions, therefore NOPs of any size dilute decode bandwidth.

Recording something so I don't forget it when we get back to the prefix padding version. The write up on the bundle align mode stuff mentions a concerning memory overhead for the feature. Since the basic implementation techniques are similar, we need to make sure we assess the memory overhead of the prefix padding implementation. See https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm for context. I don't believe this is likely to be an issue for the nop padding variant.

Recording something so I don't forget it when we get back to the prefix padding version. The write up on the bundle align mode stuff mentions a concerning memory overhead for the feature. Since the basic implementation techniques are similar, we need to make sure we assess the memory overhead of the prefix padding implementation. See https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm for context. I don't believe this is likely to be an issue for the nop padding variant.

From the doc of https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm the ".bundle_align_mode" ensure each instruction that following the directive don't cross the alignment boundary and ensure the first instruction following the directive be aligned, but in this patch we only require branch (jcc, jmp, ...) instruction don't cross the alignment boundary. Another remind, this patch avoid branch instruction hit the alignment boundary, and bundle syntax doesn't support that (see section 2.1 of https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf). So I don't think bundle syntax fit current requirement perfectly.

Branch alignment

The primary goal of this patch, restricting the placement of branch instructions, is a performance optimization. Similar to loop alignment, the desire is to increase speed, at the cost of code-size. However, the way this feature has been designed is a global assembler flag. I find that not ideal, because it cannot take into account hotness of a block/function, as for example loop alignment code does. Basic-block alignment of loops is explicitly opt-in on an block-by-block basis -- the compiler simply emits a p2align directive where it needs, and the assembler honors that. And so, MachineBlockPlacement::alignBlocks has a bunch of conditions under which it will avoid emitting a p2align. This seems like a good model -- the assembler does what it's told by the compiler (or assembly-writer). Making the branch-instruction-alignment work similarly seems like it would be good.

IMO it would be nicest if there could be a directive that requests to specially-align the next instruction. However, the fused-jcc case makes that quite tricky, so perhaps this ought to also be a mode which can be enabled/disabled on a region as well.

Yes, the primary goal of this patch is a performance optimization or mitigation. The intention is to provide a simple method for users to mitigate the performance impact of JCC MCU with less effort. We also provide users several options to tune the performance. But the basic idea is to make it easy for users to mitigate it and improve the performance.

Your proposal is a good idea. But I'm afraid it may not cover all the scenarios. Firstly, the proposal replies on compiler to detect the hotspots. But the compiler needs LTO and/or PGO to get the precise hot spots. Otherwise, if the compiler misses the hot spots which impact the application performance, the users have no way but have to insert the directives manually. Secondly, for the existing codes written in assembly, the compiler can't handle it. The users have to insert the directives by hand, which are pretty much work.

I think the current patch wants to give a simple and general solution to mitigate JCC MCU performance impact to both C/C++ and Assembly. And it doesn't need any source code change. I think your proposal will be a good enhancement on top of it.

skan added a comment.Dec 8 2019, 11:43 PM

We uncovered another functional issue with this patch, or at least, the interaction of this patch and other parts of LLVM. In our support for STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions of code which might be patched out. It's important that these regions are exactly N bytes as concurrent patching which doesn't replace an integral number of instructions is ill-defined on X86-64. This patch causes the N-byte nop sequence to sometimes become (N+M) bytes which breaks the patching. I believe that the XRAY support may have a similar issue.

More generally, I'm worried about the legality of arbitrarily prefixing instructions from unknown sources. In the particular example we saw, we had something along the following:

.Ltmp0:
.p2align 3, 0x90
(16 byte nop sequence)
.Ltmp3:
jmp *%rax

In addition to the patching legality issue above, padding the nop sequence does something else interesting in this example. It changes the alignment of Ltmp3. Before, Ltmp3 was always 8 byte aligned, after prefixes are added, it's not. It's not clear to me exactly what the required semantics here are, but we at least had been assuming the alignment of Ltmp3 was guaranteed in this case. (That's actually how we found the patching issue.)

I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with your example. So according to my understanding, I slightly modified your case. (If my understand is wrong, I hope you can point it out :-). )

    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)
    .rept 2
    nop
    .endr
    jmp .Ltmp0

The instruction jmp .Ltmp0 starts at byte 0x1e and ends at byte 0x20. If we align the jump with preifx, two prefixes will be added to the .rept2 16 nop .endr. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the label '.Ltmp3' is not 8-byte aligned any more.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since the alignment is not explicitly required. For example, I think we can not assuming the instruction jmp .Ltmp0 always starts at byte 0x1e. And in fact, as long as the binary generated by the compiler does not have a one-to-one correspondence with the assembly code, at least one implict assumption of an instruction will be broken.

I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with your example. So according to my understanding, I slightly modified your case. (If my understand is wrong, I hope you can point it out :-). )

    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)

In our case it was

andl $1, %eax

but it does not matter that much.

.rept 2
nop
.endr
jmp .Ltmp0
The instruction `jmp .Ltmp0` starts at byte 0x1e and ends at byte 0x20.

Again, in our particular case start of the sequence was at xxx8, so 8 + 16(our sequence) + 3(andl) + 5(jmp) == 32.

If we align the jump with preifx, two prefixes will be added to the .rept2 16 nop .endr. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the label '.Ltmp3' is not 8-byte aligned any more.

Yes, thats what happened.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since the alignment is not explicitly required.

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

annita.zhang added a comment.EditedDec 9 2019, 5:14 AM

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

What if I insert explicit align(8) right *after* the sequence?

reames added a comment.Dec 9 2019, 5:17 PM

I just posted an alternate review (https://reviews.llvm.org/D71238) which attempts to carve out a minimum reviewable piece of complexity. The hope is that we can review that one quickly (as there are fewer interacting concerns), and then rebase this one (possibly splitting further).

I had previously suggested in review comments that we should reuse the infrastructure from .bundle_align_mode. When I sat down to actually implement that, I discovered that the code for that has a bunch of interacting assumptions about when fragments are constructed and used vs alignment boundaries. I got a version of this working, but the complexity was worrisome. I now suggest that we should take the rough approach sketched here (a separate fragment before the one being aligned), delete the essentially unused bundle mode code, and revisit a unified representation if needed for memory density at a later time. (i.e. my previous suggestion wasn't a good one)

skan added a comment.Dec 9 2019, 5:27 PM

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

skan updated this revision to Diff 233008.Dec 9 2019, 11:48 PM

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

Well, I kinda figure that from the code behavior right now, however is it really by design or just happens to work now?
Seeing that assembler becomes very intelligent now I would rather have a strict guarantee similar to "hardcode" thing that you have here that protects my sequence
rather than relying on a fact that in current settings moving my label by 8 does not appear to be profitable to assembler.

skan added a comment.Dec 11 2019, 10:40 PM

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

Well, I kinda figure that from the code behavior right now, however is it really by design or just happens to work now?

It is by design, prefix won't be added to instruction if there is a align directive int the path to the target branch.

Seeing that assembler becomes very intelligent now I would rather have a strict guarantee similar to "hardcode" thing that you have here that protects my sequence
rather than relying on a fact that in current settings moving my label by 8 does not appear to be profitable to assembler.

It depends on what your real need is. If you just need your original implicit alignment to work, turning it into explicit alignment is enough. Or if you need a sequence not changed by
the assembler at all, maybe we need a new directive such as ".hard " or ".bundle".

skan updated this revision to Diff 233545.Dec 12 2019, 2:45 AM
  1. Replace divide and mod operation with bitwise operation
  2. Refine help information

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

Sorry for belated response. We're working hard to go through some paper work to get the performance data ready. I think maybe it's better to open a mailing thread in llvm-dev to post those performance data and discuss those suggestions.

The first data was posted in http://lists.llvm.org/pipermail/llvm-dev/2019-December/137413.html.

Thanks,
Annita

More performance data was posted on http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's move on based on the data.

skan updated this revision to Diff 234041.Dec 16 2019, 5:59 AM
  1. rename getSegmentPrefixSize(const MCInst &MI) to countSegmentPrefix(const MCInst &MI).
  2. remove unnecessary data member of X86AsmBackend std::vector<MCMachineDependentFragment *> PendingAlignmentFragments.
  3. Make the message of assert in MCMachineDependentFragment more clear.

More performance data was posted on http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's move on based on the data.

Based on the SPEC data, we observed 2.6% and 1.3% performance effect in INTRATE and FPRATE geomean respectively. Performance effect on individual components were observed up to 5.1%.

The tool SW mitigation can recover the geomean to within 99% of the original performance with prefix padding to jcc+jmp+fused. The maximum performance loss was reduced to within 2.2% of the original one.

The prefix padding can provide better performance as 0.3%~0.5% in geomean than nop padding on system with micro update. In individual cases, we observed up to 1.4% performance improvement in prefix padding. On a system w/o micro update, we observed 0.7% better performance of prefix padding on INTRATE geomean.

In this SPEC test, the prefix padding to jcc+jmp+fused and prefix padding to all branches has almost the same performance. However, we observed the latter prefix padding had a little bit better performance than the previous one in some cases at the cost of code size.

Since the performance delta in prefix padding and nop padding is incremental, starting from nop padding may be easier to implement as a first step, with additional prefix padding options to explore for additional performance optimizations.

And since the current update enables a relatively simple and general solution to mitigate JCC microcode update (MCU) performance effects to both C/C++ and Assembly without any source code change, we recommend it as a starting point. Then, the proposal for compiler to generate directives could enable further enhancements.

In offline discussion, there was an agreement that we needed further coordination to make sure this patch moves forward quickly. For that reason, there will be a call happening today at 4pm Pacific. Interested parties are welcome to attend.

Zoom Meeting ID: 507-497-8898
https://azul.zoom.us/j/5074978898

Results of the meeting will be summarized and posted here.

Haven't looked into too many details yet but made some suggestions anyway...

llvm/include/llvm/MC/MCFragment.h
634

Store AlignBoundarySize as a shift value, then needPadding doesn't even need to call Log2_64().

llvm/lib/MC/MCFragment.cpp
426

const auto *. The type is obvious according to the right hand side.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
104

An unknown value is just ignored, e.g. --x86-align-branch=unknown. I think there should be an error, but I haven't looked into the patch detail to confidently suggest how we should surface this error.

185

} else {

Should --x86-branches-within-32B-boundaries overwrite --x86-align-branch-boundary and --x86-align-branch and --x86-align-branch-prefix-size? My feeling is that it just provides a default value if either of the three options is not specified.

If you are going to remove addKind calls here, you can delete this member function.

196

space after if

No curly braces around simple statements.

660

isa<MCMachineDependentFragment>(F)

llvm/lib/Target/X86/X86InstrInfo.td
1020

Unintentional change not part of this patch?

llvm/test/MC/X86/align-branch-32-1a.s
1

Did an older version include 32-1b.s or 32-1c.s? Now they are missing.

llvm/test/MC/X86/align-branch-64-1b.s
1

Create test/MC/X86/Inputs/align-branch-64-1.s and reference it from 1[a-d].s via %S/Inputs/align-branch-64-1.s

Noting another issue we found in local testing (with an older version of this patch). This interacts badly with the implicit exception mechanism in LLVM. For that mechanism, we end up generating assembly which looks more or less like this:
Ltmp:

cmp %rsi, (%rdi)
jcc <target>

And a side table which maps TLmp to another label so that a fault at Ltmp can be interpreted as an extremely expensive branch via signal handler.

The problem is that the auto-alignment of the fused branch causes padding to be introduced which separate the label and the faulting exception, breaking the mapping.

Essentially, this comes down to an implicit assumption that the label stays tightly bundled with the following instruction.

This can happen with either nop or prefix padding.

Just wanted to say thanks for the performance data! I know it was hard to get, but it is really, really useful to help folks evaluate these kinds of changes with actual data around the options available.

skan marked an inline comment as done.Dec 16 2019, 5:17 PM
skan added inline comments.
llvm/lib/MC/MCFragment.cpp
426

Shall we keep consistent with the local code style? const MCLEBFragment *LF = cast<MCLEBFragment>(this); was used here.

Here are the minutes from our phone call a few minutes ago.

Attendees: Andy Kaylor, Craig Topper, Annita Zhang, Tom Stellard, Chandler Carruth, Fedor Sergeev, Philip Reames, Yuanake Luo

Status Summary

Performance data has been posted to llvm-dev. We had a side discussion about nop encoding, and it was mentioned these numbers were collected from runs targeting skylake (i.e. not generic x86). This is similar to the result we (Azul) have collected and shared summaries of previously.

GNU patch has landed Friday - this mostly fixes assembler command line.

Discussion on Approach

Three major options debated:

Assembler only - as in the current patch, assembler does all work, only command line flag
Explicit Directive only - as proposed in my alternate patch, compiler decides exactly what instructions get aligned
Region based directives - as proposed in James' last comment on review, directives enable and disable auto-padding in assembler

Use cases identified:

compiler users
    important than assembler is self contained (i.e. don't have to know compiler options for reproduceability)
    inline assembly looks a lot like assembler users
legacy assembler
    important that existing assembly works unmodified
assembler users
    "try it and see" model vs selective enable vs selective disable
    likely need to support all three

Consensus was that the region based directives met use cases the best. In particular, desire to be able to overrule default (for say, inline assembly or a JITs patchability assumptions) and then restore default. Default assembler behavior remains unchanged.

Stawman syntax proposal

.align_branch_boundary disable/default
.align_branch_boundary enable N, instructions (fused, jcc, jmp, etc..)

We need to ensure a consensus on syntax is shared w/gnu. Annita agreed to coordinate this.

Compiler would essentially just wrap generated assembly in directives.

Issue noticed while writing this up: proposed syntax assumes a default has been set, but doesn't give a way to set one. This would seem to break the desired reproducibility property for compiled code. Revision needed.

Push/Pop semantics were suggested at one point, but were thought to be non-idiomatic?

Other Topics

We very quickly discussed nop vs prefix performance. There was a clear consensus in starting with nop only patch and evolving later as needed.

Next Steps

Annita will refresh current patch with two key changes. 1) Drop prefix support and simplify and 2) drop clang driver support for now. Desire is to minimize cycle time before next iteration so that feedback on approach can be given while reviewers are still around.

Philip will prototype directive parsing support. Annita and Yuo (??) to handle coordination on syntax.

Suggested patch split:

(current patch) command line option to set default, nop only version w/cleaned up code as much as possible
assembler directive support (draft by Philip in parallel)
(future) compiler patch to wrap by default

Side note to Annita: For you to remove "hard code", you'll have to have a placeholder for the enable/disable interface. That should probably be split and rebased in my patch.

Here are the minutes from our phone call a few minutes ago.

Attendees: Andy Kaylor, Craig Topper, Annita Zhang, Tom Stellard, Chandler Carruth, Fedor Sergeev, Philip Reames, Yuanake Luo

Thanks for organization the meeting and making the summary.

Stawman syntax proposal

.align_branch_boundary disable/default
.align_branch_boundary enable N, instructions (fused, jcc, jmp, etc..)

...

Push/Pop semantics were suggested at one point, but were thought to be non-idiomatic?

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I confirm that the following 4 commits have been to the binutils-gdb repository (https://sourceware.org/ml/binutils/2019-12/msg00138.html https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=e379e5f385f874adb0b414f917adb1fc50e20de9).

gas: Add md_generic_table_relax_frag
i386: Align branches within a fixed boundary
i386: Add -mbranches-within-32B-boundaries
i386: Add tests for -malign-branch-boundary and -malign-branch

Here are the minutes from our phone call a few minutes ago.

Thanks for coordinating the meeting and having a clear summary. It helps a lot to accelerate the patch review. I really appreciate it!

Annita will refresh current patch with two key changes. 1) Drop prefix support and simplify and 2) drop clang driver support for now. Desire is to minimize cycle time before next iteration so that feedback on approach can be given while reviewers are still around.

Yes, we are working on it right now. Hopefully we can submit a new patch today or tomorrow.

Philip will prototype directive parsing support. Annita and Yuo (??) to handle coordination on syntax.

I suppose it's Annita and Fangrui

Side note to Annita: For you to remove "hard code", you'll have to have a placeholder for the enable/disable interface. That should probably be split and rebased in my patch.

Let's do it in your directive patch.

skan added a comment.Dec 17 2019, 12:24 AM

Noting another issue we found in local testing (with an older version of this patch). This interacts badly with the implicit exception mechanism in LLVM. For that mechanism, we end up generating assembly which looks more or less like this:
Ltmp:

cmp %rsi, (%rdi)
jcc <target>

And a side table which maps TLmp to another label so that a fault at Ltmp can be interpreted as an extremely expensive branch via signal handler.

The problem is that the auto-alignment of the fused branch causes padding to be introduced which separate the label and the faulting exception, breaking the mapping.

Essentially, this comes down to an implicit assumption that the label stays tightly bundled with the following instruction.

This can happen with either nop or prefix padding.

How about insert NOP before the label Ltmp?

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I rethink about the directives and prefer the .push/.pop pair as @MaskRay suggested. To be specified, I'd suggest to use .push_align_branch_boundary and .pop_align_branch_boundary to align with MC command line options. They will cowork with the command line options and overwrite the options if both are existing.

To be clarified, I described the behavior of the directives from my understanding. Feel free to speak if you have difference opinion.

.push_align_branch_boundary [N,] [instruction,]*

This directive specifies the beginning of a region which will overwrite the value set by the command line or by the previous directive. It can represent either an enabling or disabling directive controlled by parameter N. 
N indicates to align the branches within N byte boundary. The default value is 32. If N is 0, it means the branch alignment is off within this region. 
Instruction specifies types of branches to align. The value is one or multiple values from fused, jcc, jmp, call, ret and indirect. The default value is fused, jcc and jmp. (may change later)

.pop_align_branch_boundary

This directive specifies the end of a region to align branch boundary. The status will be back to which was set by the previous directive or the one set by the command line if there's no previous directive existing.

I will coordinate with GNU binutils community once we discuss and have agreement with the directives.

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

Whether this is OK or not on a particular piece of assembly-code is likely to be a generic property of the code, regardless of the purpose of the optimization. If we're going to have multiple assembler optimizations that can make use of this, it would be nice to express the "OK to pad" "not OK to pad" property only once, rather than once for each kind of optimization which might make such modifications.

In particular, I'd like to look ahead towards the potential implementation of two other features:

  1. Allowing the assembler to prefix-pad instructions in order to avoid having to emit a NOP for p2align directives.
  2. Allowing the assembler to do other instruction-padding performance optimizations to avoid other DSB cacheline limits.

To be concrete, I propose:
".autopad", ".noautopad": allow/disallow the assembler to emit padding via inserting a nop or prefix before any instruction, as needed.
".align_branch_boundary [N,] [instruction,]": Enable branch-boundary padding (per previous description).

In this scheme, I'd generally expect an ".align_branch_boundary" directive to be specified once at the beginning of the file, and ".autopad"/".noautopad" directives to be sprinkled throughout the file as required.

skan updated this revision to Diff 234312.Dec 17 2019, 8:58 AM
skan retitled this revision from Align branches within 32-Byte boundary to Align branches within 32-Byte boundary(NOP padding).
skan edited the summary of this revision. (Show Details)

Simplify

  1. Drop prefix padding support
  2. Drop clang driver support
  3. Drop default align option -x86-branches-within-32B-boundaries
  4. Drop hardcode support

Other

  1. Throw an error if an illegal value is passed to option -x86-align-branch
  2. Use llvm::Align instead of uint64_t to store the information about boundary
  3. Remove test cases for prefix padding and add more test cases for NOP padding
  4. Rename MCMachineDependentFragment to MCBoundaryAlignFragment

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I rethink about the directives and prefer the .push/.pop pair as @MaskRay suggested. To be specified, I'd suggest to use .push_align_branch_boundary and .pop_align_branch_boundary to align with MC command line options. They will cowork with the command line options and overwrite the options if both are existing.

I agree that we need the push/pop semantics.

To be clarified, I described the behavior of the directives from my understanding. Feel free to speak if you have difference opinion.

.push_align_branch_boundary [N,] [instruction,]*

This directive specifies the beginning of a region which will overwrite the value set by the command line or by the previous directive. It can represent either an enabling or disabling directive controlled by parameter N. 
N indicates to align the branches within N byte boundary. The default value is 32. If N is 0, it means the branch alignment is off within this region. 
Instruction specifies types of branches to align. The value is one or multiple values from fused, jcc, jmp, call, ret and indirect. The default value is fused, jcc and jmp. (may change later)

I'd remove the defaults. Let's just be explicit about what is being enabled/disabled.

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

James, I think this proposal is increasing the scope of this proposal too much. It also ignores some of the use cases identified and described in the writeup (i.e. the scoped semantics). I'm open to discussing such a feature more generally, but I'd prefer to see a more narrowly focused feature immediately.

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

  1. BoundaryAlign w/Target = 3
  2. DataFragment containing TEST RAX, RAX
  3. RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

James, I think this proposal is increasing the scope of this proposal too much. It also ignores some of the use cases identified and described in the writeup (i.e. the scoped semantics). I'm open to discussing such a feature more generally, but I'd prefer to see a more narrowly focused feature immediately.

I do not intend that we expand the scope of the project to include any of the other features.

All I want is to slightly consider surrounding features when adding the new assembly syntax. The situations where we want to avoid modifying a certain block of code are extremely likely to apply to any nop-or-prefix-introducing code modifications -- not just modifications resulting from branch alignment. So if we can make the directives annotating where such changes are allowable (and conversely, where they are not) generally-applicable, with a very minimal amount of work, that would be nice.

I also don't understand what you mean by "it ignores [...] scoped semantics"?

I've updated the draft assembler support in https://reviews.llvm.org/D71315 to match the proposal here. Again, this is very much WIP and mostly just to show proposed syntax/usage.

skan added a comment.EditedDec 17 2019, 9:25 PM

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

  1. BoundaryAlign w/Target = 3
  2. DataFragment containing TEST RAX, RAX
  3. RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

JUMP is not always emiteed into MCRelaxableFragment, it also can be emitted into MCDataFragment and arithmetic ops with constant arguments of unknown value (e.g. ADD,AND) can be emitted into MCRelaxableFragment , you can find related code in MCObjectStreamer::EmitInstructionImpl, X86AsmBackend::mayNeedRelaxation. Let's say JCC is fused with TEST, there are four possible positions for JCC and CMP

  1. JCC and CMP are in same MCDataFragment
  2. JCC and CMP are in two different MCDataFragment
  3. JCC and CMP are in two different MCRelaxableFragment
  4. JCC in a MCRelaxableFragment, CMP is in a MCDataFragment

and since MCCompactEncodedInstFragment is not applicable yet, i don't what's its behaviour.

In order to compute the total size of CMP and JCC in MCAssembler::relaxBoundaryAlign, I insert a FusedJccSplit to force CMP and JCC in two fragments. Do you have any better idea?

skan updated this revision to Diff 234488.Dec 18 2019, 2:01 AM
skan edited the summary of this revision. (Show Details)

Simplify

Drop the subtype of MCBoundaryAlignFragment and add data member EmitNops to indicate whether NOPs should be emitted.

skan updated this revision to Diff 234515.Dec 18 2019, 5:00 AM
  1. rename MCBoundaryAlignFragment::hasEmitNop() to MCBoundaryAlignFragment::canEmitNop()
  1. reduce the number of MCBoundaryAlignFragment emitted as possible
skan updated this revision to Diff 234520.Dec 18 2019, 5:20 AM

Fix a typo in MCFragment::dump()

skan updated this revision to Diff 234650.Dec 18 2019, 7:38 PM

move the code that checks if we can reuse the current MCBoundaryAlignFragment into the function X86AsmBackend::getOrCreateBoundaryAlignFragment

skan updated this revision to Diff 234697.Dec 19 2019, 4:08 AM

Add more comment

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

  1. BoundaryAlign w/Target = 3
  2. DataFragment containing TEST RAX, RAX
  3. RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

JUMP is not always emiteed into MCRelaxableFragment, it also can be emitted into MCDataFragment and arithmetic ops with constant arguments of unknown value (e.g. ADD,AND) can be emitted into MCRelaxableFragment , you can find related code in MCObjectStreamer::EmitInstructionImpl, X86AsmBackend::mayNeedRelaxation. Let's say JCC is fused with TEST, there are four possible positions for JCC and CMP

  1. JCC and CMP are in same MCDataFragment
  2. JCC and CMP are in two different MCDataFragment
  3. JCC and CMP are in two different MCRelaxableFragment
  4. JCC in a MCRelaxableFragment, CMP is in a MCDataFragment

and since MCCompactEncodedInstFragment is not applicable yet, i don't what's its behaviour.

In order to compute the total size of CMP and JCC in MCAssembler::relaxBoundaryAlign, I insert a FusedJccSplit to force CMP and JCC in two fragments. Do you have any better idea?

I agree there are multiple cases here, see the original generic description instead of the specific example. The general question is why a *range* of fragments can't be defined. Computing the instruction size for the entire range then just requires walking from first to last fragment in the range summing the size of each. If both instructions are within the same data fragment, then no relaxation is needed, and the size of both is simply the size of the data fragment.

Unless I'm missing something here?

reames accepted this revision.Dec 19 2019, 2:51 PM

LGTM. The patch isn't perfect, but this has reached the point where landing and iterating in tree is the fastest way to convergence. To be clear, this LGTM *only* applies to the current patch contents, and the *internal only* flag names this introduces. These may change before we expose this publicly.

Warts with the current patch we should iterate to address:

  • Stylistic issues w.r.t. comments, naming, static vs member functions remain. None are show stoppers.
  • Many of the tests can probably be simplified and condensed.
  • The bundling scheme used is probably more complicated than needed (see previous suggestions).
  • This patch doesn't include anyway to locally disable padding, and thus is *known incorrect* in some cases. As this remains off by default, this is not a blocker for commit.

p.s. I spoke w/James this morning, and we came up with some revisions in approach he's going to suggest separately. We did explicitly discuss the status of the current patch, and whether it needed further review cycles or could be iterated in tree. Normally, I'd wait for him to summarize that conversation publicly, but given the time delay and vacation schedules involved, I want to avoid loosing another day cycle.

llvm/include/llvm/MC/MCFragment.h
569

Please add a note that the size is lazily set during relaxation, and is not meaningful before that.

llvm/test/MC/X86/align-branch-32-1a.s
35

For testing alignment functionality, we can probably use a repl int3 pattern here. It would make the tests much more concise, and shouldn't effect the logic being (currently) tested.

To be concrete, I propose:
".autopad", ".noautopad": allow/disallow the assembler to emit padding via inserting a nop or prefix before any instruction, as needed.
".align_branch_boundary [N,] [instruction,]": Enable branch-boundary padding (per previous description).

I had thought I sent the comments yesterday, but I didn't. :( I think my comments are aligned with the conclusion that Philip and Jame got. Wait for Jame's summary.

In general, I think it's a good idea to have generic directives to control the padding behavior in assembler. ".autopad", ".noautopad", ".align_branch_boundary" looks good except it can't handle the nested scenarios. I can imagine the nested cases in assembly file which includes another assembly file. If we want to handle it, we need to have a pair of directives for each above. It will make the semantics complicated. We need elaborately design it.
I'd echo what Philip said. We want a more focused and basic implementation here. We're very open to have more discussion on generic directives. However, I'd prefer it's a separate topic from this one.

skan added a comment.Dec 19 2019, 6:16 PM

The general question is why a *range* of fragments can't be defined. Computing the instruction size for the entire range then just requires walking from first to last fragment in the range summing the size of each. If both instructions are within the same data fragment, then no relaxation is needed, and the size of both is simply the size of the data fragment.

Unless I'm missing something here?

Thank you for your detailed explanation, the solution has been consistent with your suggestion now.

MaskRay added inline comments.Dec 19 2019, 9:40 PM
llvm/lib/MC/MCAssembler.cpp
1004

be relaxed

1008

uint64_t

1010

auto -> MCFragment

1014

unsigned -> int

1018

uint64_t

skan marked an inline comment as done.Dec 19 2019, 10:59 PM
skan added inline comments.
llvm/lib/MC/MCAssembler.cpp
1014

Why use int here?

skan updated this revision to Diff 234830.Dec 19 2019, 11:42 PM
  1. Simplify the test cases.
  2. Add some comments
skan marked 6 inline comments as done.Dec 19 2019, 11:44 PM
MaskRay added inline comments.Dec 20 2019, 12:04 AM
llvm/lib/MC/MCAssembler.cpp
1014

The loop variable can only be 0 or 1. 0U 1U 2U the unsigned suffix are just redundant. int suffices. It also improves readability a bit by avoiding auto.

skan added a comment.Dec 20 2019, 12:29 AM

Do you think this patch is ready to land? @MaskRay

Do you think this patch is ready to land? @MaskRay

It is 00:35 here and I should confess that I haven't read through this yet.

There are some minor things like (1) pervasive auto (2) report_fatal_error in X86AlignBranchKind is not suitable. I expect the error is reported at a command line parsing stage. (3) I don't see how __tls_get_addr is referenced in code but somehow it magically appears in tests. It may be related to VK_NONE but there should be more tests for !VK_NONE cases. (4) More function-level comments are needed.

If it is not super urgent, probably want a couple of hours to get a response from @reames or @jyknight whether they think this can be committed as is, with iteration work in the future?

(Also note that @fedor.sergeev has requested changes. The convention is to wait for @fedor.sergeev to agree that this can be accepted... but there are vacation schedules involved.)

llvm/lib/MC/MCAssembler.cpp
1020

Align

skan updated this revision to Diff 234846.Dec 20 2019, 2:41 AM
  1. Add more tests for !VK_NONE cases.
  2. Reduce pervasive auto
skan marked an inline comment as done.Dec 20 2019, 2:42 AM
This revision was not accepted when it landed; it landed in state Needs Review.Dec 20 2019, 11:41 AM
This revision was automatically updated to reflect the committed changes.

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

  1. We need to settle on assembler syntax.
  2. We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)
  3. We can incrementally add support for prefix padding.
MaskRay added a comment.EditedDec 20 2019, 5:00 PM

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

  1. We need to settle on assembler syntax.
  2. We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)
  3. We can incrementally add support for prefix padding.

If you planned to clean up, you could have made the cleanups in the original land:) Though there would be a problem of proper contribution attribution. After Subversion->Git transition, we can actually run `git commit --amend --author=ask-author-to-provide-name-and- <email>' (Though changing the author may not be fair to your contribution to this commit... oh it is so difficult.) Anyway, it is nice to see this feature before Christmas and people can start investigating its impact now.

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

  1. We need to settle on assembler syntax.
  2. We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)
  3. We can incrementally add support for prefix padding.

Thanks for landing it! And happy holiday to all!

MTC added a subscriber: MTC.Apr 21 2021, 11:52 PM