This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/MC/
-
llvm/
-
MC/
-
MCAsmBackend.h
-
MCAssembler.h
4/6
MCFragment.h
-
MCObjectStreamer.h
-
lib/
-
MC/
9/13
MCAssembler.cpp
2/4
MCFragment.cpp
-
MCObjectStreamer.cpp
-
Target/X86/MCTargetDesc/
-
X86/
-
MCTargetDesc/
14/24
X86AsmBackend.cpp
-
test/MC/X86/
-
MC/
-
X86/
1/2
align-branch-32-1a.s
-
align-branch-64-1a.s
1
align-branch-64-1b.s
-
align-branch-64-1c.s
-
align-branch-64-1d.s
-
align-branch-64-2a.s
-
align-branch-64-2b.s
-
align-branch-64-2c.s
-
align-branch-64-3a.s
-
align-branch-64-4a.s
-
align-branch-64-5a.s
-
align-branch-64-5b.s

Differential D70157

Align branches within 32-Byte boundary(NOP padding)
ClosedPublic

Authored by skan on Nov 12 2019, 6:50 PM.

Download Raw Diff

Details

Reviewers

xiangzhangllvm
LuoYuanke
pengfei
craig.topper
MaskRay
jyknight
chandlerc
annita.zhang
ruiu
fedor.sergeev
reames

Commits

rG14fc20ca6282: Align branches within 32-Byte boundary (NOP padding)

Summary

Microcode update for Jump Conditional Code Erratum may cause performance
loss for some workloads:

https://www.intel.com/content/www/us/en/support/articles/000055650.html

Here is the patch to mitigate performance impact by aligning branches
within 32-byte boundary. The impacted instructions are:

a. Conditional jump.
b. Fused conditional jump.
c. Unconditional jump.
d. Indirect jump.
e. Ret.
f. Call.

Add two options for llvm-mc:

-x86-align-branch-boundary=NUM aligns branches within NUM byte boundary.
-x86-align-branch=TYPE[+TYPE...] specifies types of branches to align.

to align branches within a 32-Byte boundary to reduce the potential performance
loss of the microcode update.

A new MCFragment type, MCBoundaryAlignFragment, is added, which may emit
NOP to align the fused/unfused branch.

alignBranchesBegin inserts MCBoundaryAlignFragment before instructions,
alignBranchesEnd marks the end of the branch to be aligned,
relaxBoundaryAlign grows or shrinks sizes of NOP to align the target branch.

Nop padding is disabled when the instruction may be rewritten by the linker,
such as TLS Call.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

jyknight added inline comments.Nov 15 2019, 1:30 PM

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
393	This set of functions down to isIndirectBranch() seems unnecessary. Pushing one line const MCInstrDesc &InstDesc = MCII.get(Inst.getOpcode()); into needAlign(const MCInst &Inst), and then just using InstDesc.isReturn() etc. would be fine.
438	please run something like "git clang-format HEAD~1" to re-format your patch.
442	Comment on why this is doing what it's doing?
616–617	Why?
llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
106 ↗	(On Diff #229511)	"MF" is not a readable abbreviation here -- spell out "MacroFusion" instead. It was OK locally in the "X86MacroFusion.cpp" file, as the filename gave you a hint, but not here, where you have no such hint.
134 ↗	(On Diff #229511)	This function is also named poorly, after being moved to this more-generic location. "Classify" doesn't tell me anything about this being related to macro-fusion.

skan updated this revision to Diff 229689.Nov 16 2019, 6:04 AM

skan marked 11 inline comments as done.Nov 17 2019, 11:41 PM

skan updated this revision to Diff 230241.Nov 20 2019, 5:43 AM

Not insert nop or prefix if the previous item is hard code, which may be
used to hardcode an instruction, since there is no clear instruction boundary.

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:
- For most instructions, that'll be BranchPrefix.
- For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit
- For fused conditionals, it'll be FusedJccPadding.
After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.
Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

llvm/lib/MC/MCAssembler.cpp
1019	I don't think this is necessary. AFAICT, the symbols should already be in the right place -- pointing to the relax fragment, not the instruction itself, without this. And removing all this moveSymbol/updateSymbolMap code doesn't make any tests fail.

In D70157#1755927, @jyknight wrote:

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:

For most instructions, that'll be BranchPrefix.

For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit

For fused conditionals, it'll be FusedJccPadding.

After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.

Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I don't have any numbers myself. I was only involved in some of the code review internally. My understanding is that NOP instructions would place extra nop uops into the DSB(the decoded uop buffer) and that limits the performance that can be recovered. By using redundant prefixes no extra uops are generated and more performance is recovered.

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

efriedma added a subscriber: efriedma.Nov 21 2019, 5:10 PM

efriedma added inline comments.

llvm/include/llvm/MC/MCFragment.h
592	Global variables are forbidden in LLVM libraries; there could be multiple LLVMContexts in the same process.

In D70157#1755927, @jyknight wrote:

Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:

Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:

For most instructions, that'll be BranchPrefix.

For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit

For fused conditionals, it'll be FusedJccPadding.

After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.

Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).

All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.

Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

Thanks for your detailed and accurate analysis for my code! I am sorry that this should have been done by me.

llvm/lib/MC/MCAssembler.cpp
1019	Yes, I check it and you are right. I will removing all this moveSymbol/updateSymbolMap code.

Three changes are made:

Remove moveSymbol/updateSymbolMap code since it is not necessary
Make variable AlignBoundarySize and variable AlignMaxPrefixSize not global
Disable nop padding before instruction with variant symbol operand since it may be rewritten by linker.

Speed up if we only want to simply emit NOPs before branches as needed.

In D70157#1755927, @jyknight wrote:

An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

Thanks for your remind! Now, if -malign-branch-prefix-size=0 is used, the method will only insert BranchPadding or FusedJccPadding before branches as needed, and insert BranchPrefix before the instruction which is macro fusible but not macro fused. In this condition, the operation is as simple as inserting nop only. Since more performance can be recovered by inserting prefixes than inserting nops, I believe I should support prefix padding.

skan edited the summary of this revision. (Show Details)Nov 22 2019, 3:48 AM

(Just a reminder that we need to have both performance and code size numbers for this patch. And given that there are a few options, may need a few examples.)

I have some other concerns about the code itself, but after pondering this a little bit, I'd like to instead bring up some rather more general concerns about the overall approach used here -- with some suggestions. (As these comments are not really comments on the _code_, but on the overall strategy, they also apply to the gnu binutils patch for this same feature.)

Of course, echoing chandler, it would be nice to see some performance numbers, otherwise it's not clear how useful any of this is.

Segment-prefix padding

The ability to pad instructions instead of adding a multibyte-NOP in order to create alignment seems like a generally-useful feature, which should be usable in other situations where we align within a function -- assuming that there is indeed a measurable performance benefit vs NOP instructions. E.g. we should do this for basic-block alignment, as well! As such, it feels like this feature ought to be split out, and implemented separately from the new branch-alignment functionality -- in a way which is usable for any kind of alignment request.

The way I'd imagine it working is to introduce a new pair of asm directives, to enable and disable segment-prefix padding in a certain range of instructions (let's say ".enable_instr_prefix_pad", ".disable_instr_prefix_pad". Within such an enabled range, the assembler would prefix instructions as required in order to handle later nominally-nop-emitting code alignment directives (such as the usual '.p2align 4, 0x90') .

Branch alignment

The primary goal of this patch, restricting the placement of branch instructions, is a performance optimization. Similar to loop alignment, the desire is to increase speed, at the cost of code-size. However, the way this feature has been designed is a global assembler flag. I find that not ideal, because it cannot take into account hotness of a block/function, as for example loop alignment code does. Basic-block alignment of loops is explicitly opt-in on an block-by-block basis -- the compiler simply emits a p2align directive where it needs, and the assembler honors that. And so, MachineBlockPlacement::alignBlocks has a bunch of conditions under which it will avoid emitting a p2align. This seems like a good model -- the assembler does what it's told by the compiler (or assembly-writer). Making the branch-instruction-alignment work similarly seems like it would be good.

IMO it would be nicest if there could be a directive that requests to specially-align the next instruction. However, the fused-jcc case makes that quite tricky, so perhaps this ought to also be a mode which can be enabled/disabled on a region as well.

Enabling by default

Previously, I'd mentioned that it seemed likely we couldn't actually enable branch-alignment by default because it'll probably break people's inline-asm and standalone asm files. That would be solved by making everything controllable within the asm file. The compiler could then insert the directives for its own code, and disable it around inline assembly. And standalone asm files could remain unaffected, unless they opt in. With that, we could actually enable the alignment by default, for compiled output in certain cpu-tuning modes, if it's warranted.

fedor.sergeev added a subscriber: fedor.sergeev.Dec 1 2019, 10:35 PM

reames added a subscriber: reames.Dec 2 2019, 9:02 AM

I want to chime in support of jyknight's meta comments - particularly the one about the need to balance execution speed vs code size differently in hot vs cold code. For our use case, we have a very large amount of branch dense known cold paths, and being able to only align fast path branches would be a substantial space savings.

I also see value in having the prefix padding feature factored out generically. If that mechanism is truly measurably faster than multi-byte nops - which if I reading comments correctly, has been claimed but not documented or measured? - using it generically for other alignment purposes would likely be worthwhile.

I'd also like to see - probably in a separate patch - support for auto-detecting whether the host CPU needs this mitigation. Both -mcpu=native and various JITs will end up needing this, having the code centralized in one place would be good.

Just FYI for now as I'm trying to dig futher...
I have been trying this fix in our downstream environment and managed to get a hang with this backtrace:

#0 needPadding (BoundarySize=<optimized out>, Size=<optimized out>, StartAddr=<optimized out>) at ./llvm/lib/MC/MCAssembler.cpp:1028
#1 llvm::MCAssembler::relaxMachineDependent (this=<optimized out>, Layout=..., MF=...) at ./llvm/lib/MC/MCAssembler.cpp:1077
#2 0x00007ffff1adc8b6 in llvm::MCAssembler::layoutSectionOnce (this=this@entry=0x7fff90580580, Layout=..., Sec=...) at ./orca/llvm/lib/MC/MCAssembler.cpp:1213
...

Working on getting upstream reproducer....

In D70157#1767212, @fedor.sergeev wrote:

Working on getting upstream reproducer....

Hangs on a moderately small piece (~150 lines) of x86 assembly, filed a bug on it here:

https://bugs.llvm.org/show_bug.cgi?id=44215

This revision now requires changes to proceed.Dec 3 2019, 2:42 PM

fix the bug

https://bugs.llvm.org/show_bug.cgi?id=44215

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

craig.topper added inline comments.Dec 3 2019, 10:23 PM

llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
158 ↗	(On Diff #232032)	None of the AND/ADD/SUB instructions ending in mr are eligible for macrofusion as far as I know. Those all involve a load and a store which is not supported by macrofusion. We also lost all the ADD*_DB instructions from the macrofusion list. I believe they are in the existing list incorrectly. So removing them is correct, but as far as I can see that change was not mentioned in the description of this patch. Can we split the macrofusion refactoring out of this patch so we can review it separately and hopefully get it committed sooner than the other review feedback.

In D70157#1768319, @chandlerc wrote:

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

Sorry for belated response. We're working hard to go through some paper work to get the performance data ready. I think maybe it's better to open a mailing thread in llvm-dev to post those performance data and discuss those suggestions.

The first data was posted in http://lists.llvm.org/pipermail/llvm-dev/2019-December/137413.html.

Thanks,
Annita

Fix the macro fusion table. If the first source/destination operand of an instruction is memory, it is unfusible.

skan marked an inline comment as done.Dec 4 2019, 12:20 AM

skan added inline comments.

llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h
158 ↗	(On Diff #232032)	Okay, I will upload another patch to correct the macrofusion table as soon as possible.

Can you please put the macro fusion changes in a separate phabricator review. I’ll review it in the morning US time and if it all looks good we can get that part committed while the other comments are being addressed.

In D70157#1768389, @craig.topper wrote:

Can you please put the macro fusion changes in a separate phabricator review. I’ll review it in the morning US time and if it all looks good we can get that part committed while the other comments are being addressed.

Sure.

skan added a child revision: D70999: Fix the macro fusion table for X86 according to Intel optimization manual.Dec 4 2019, 2:13 AM

skan removed a child revision: D70999: Fix the macro fusion table for X86 according to Intel optimization manual.

We shouldn't try to insert prefix/nop t in a virtual section.

Since hardcode only exists in text section, we should not insert HardCodeBegin/HardCodeEnd in other section, such as a virtual section.

spatel added a subscriber: spatel.Dec 4 2019, 6:57 AM

I am still trying to understand the patch. Just made some comments about the tests.

llvm/include/llvm/MC/MCFragment.h
571	Don’t duplicate function or class name at the beginning of the comment (`BranchPadding -` ). (ref: https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments)
582	Full stop.
617	Move llvm_unreachable below the switch, otherwise clang will give a warning: warning: default label in switch which covers all enumera tion values [-Wcovered-switch-default] Unfortunately all GCC (even 9) -Wall will warn `warning: control reaches end of non-void function [-Wreturn-type]` unless you place an unreachable statement.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
511	Space after `if`
llvm/test/MC/X86/x86-64-align-branch-1a.s
1 ↗	(On Diff #232116)	1a~1g use the same source file. Move the source to `Inputs/align-branch-1-64.s` According to the local naming convention, this test should probably be renamed to `align-branch-1-64.s`
7 ↗	(On Diff #232116)	Delete
11 ↗	(On Diff #232116)	Delete `Disassembly of section .text:`. Ditto below.
llvm/test/MC/X86/x86-64-align-branch-1b.s
4 ↗	(On Diff #232116)	Delete
10 ↗	(On Diff #232116)	I think 1a.s and 1b.s should be merged. FileCheck supports --check-prefixes=CHECK,PREFIX5 --check-prefixes=CHECK,PREFIX1 CHECK: common part CHECK-NEXT: common part PREFIX5: PREFIX5-NEXT: PREFIX1: PREFIX1-NEXT: CHECK: CHECK-NEXT: % diff -U1 x86-64-align-branch-1[ab].s # CHECK: 0000000000000000 foo: -# CHECK-NEXT: 0: 64 64 64 64 89 04 25 01 00 00 00 movl %eax, %fs:1 -# CHECK-NEXT: b: 55 pushq %rbp -# CHECK-NEXT: c: 55 pushq %rbp -# CHECK-NEXT: d: 55 pushq %rbp +# CHECK-NEXT: 0: 64 89 04 25 01 00 00 00 movl %eax, %fs:1 +# CHECK-NEXT: 8: 2e 55 pushq %rbp +# CHECK-NEXT: a: 2e 55 pushq %rbp +# CHECK-NEXT: c: 2e 55 pushq %rbp # CHECK-NEXT: e: 48 89 e5 movq %rsp, %rbp Is there performance benefit to add 4 prefixes to the same instruction?
llvm/test/MC/X86/x86-64-align-branch-1c.s
2 ↗	(On Diff #232116)	The difference between 1a and 1c is that 1c does not allow "jmp", but in 1a no jmp instructions get a prefix in the test, so it is unclear why 1c has different output.
llvm/test/MC/X86/x86-64-align-branch-1d.s
4 ↗	(On Diff #232116)	Delete
llvm/test/MC/X86/x86-64-align-branch-1e.s
46 ↗	(On Diff #232116)	This is weird. Comparing this with 1d, 1e allows more instruction types, yet it inserts two NOPs which actually seems to degrade performance.
llvm/test/MC/X86/x86-64-align-branch-1f.s
8 ↗	(On Diff #232116)	No disassembly is needed. Just check that `--x86-align-branch-boundary=0` and the default (no x86- specific options) have the identical output (`cmp %t %t2`)
llvm/test/MC/X86/x86-64-align-branch-1g.s
1 ↗	(On Diff #232116)	Merge 1e and 1g. State that `-mcpu=x86-64` generates `66 90` instead of `90 90` (but why?)
3 ↗	(On Diff #232116)	Delete
7 ↗	(On Diff #232116)	Delete `Disassembly of section .text:`

I find another deficiency (infinite loop) with the current approach.

Say, there is a je 0 (0x0F 0x84 0x00 0x00 0x00 0x00) at byte 0x90. (0x90+6)%32 == 0, so it ends on a 32-byte boundary.
MF.getMaxPrefixSize() is 4, so the size of MCMachineDependentFragment may vary from 0 to 4.
If there are other MCMachineDependentFragment's in the section, some may shrink while some may expand.
In some cases the following loop will not converge

bool MCAssembler::layoutOnce(MCAsmLayout &Layout) {
  ++stats::RelaxationSteps;

  bool WasRelaxed = false;
  for (iterator it = begin(), ie = end(); it != ie; ++it) {
    MCSection &Sec = *it;
    while (layoutSectionOnce(Layout, Sec)) ///
      WasRelaxed = true;
  }

  return WasRelaxed;
}

// In MCAssembler::layoutSectionOnce,
  case MCFragment::FT_MachineDependent:
    RelaxedFrag =
        relaxMachineDependent(Layout, *cast<MCMachineDependentFragment>(I));
    break;

To give a concrete example, clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge. You may also try dtor-*.cpp in that directory.

A simple iterative algorithm is not guaranteed to converge. We probably can solve the layout problem with a dynamic programming algorithm:

f[i][j] = the minimum cost that layouts the first i instructions with j extra bytes (via NOPs or prefixes)
or
g[i][j] = the minimum inserted bytes that layouts the first i instructions with cost j

I am not clear which one is better. A simple greedy approach is to set an upper limit on the number of iterations when the section contains at least one MCMachineDependentFragment, i.e.

bool HasMCMachineDependentFragment = false;
int count = 5; // arbitrary. Please find an appropriate value.
while (layoutSectionOnce(Layout, Sec, HasMCMachineDependentFragment) && count > 0) {
  WasRelaxed = true;
  if (HasMCMachineDependentFragment)
    count--;
}

MaskRay added inline comments.Dec 4 2019, 5:59 PM

llvm/lib/MC/MCAssembler.cpp
975	Division is slow. Pass in the power of 2 and use right shift instead. You may change `MCMachineDependentFragment::AlignBoundarySize` (a power of 2) to a power.
983	Ditto.

skan updated this revision to Diff 232326.Dec 5 2019, 5:52 AM

skan marked 5 inline comments as done.Dec 5 2019, 5:56 AM

In D70157#1769932, @MaskRay wrote:
I find another deficiency (infinite loop) with the current approach.

Say, there is a je 0 (0x0F 0x84 0x00 0x00 0x00 0x00) at byte 0x90. (0x90+6)%32 == 0, so it ends on a 32-byte boundary.
MF.getMaxPrefixSize() is 4, so the size of MCMachineDependentFragment may vary from 0 to 4.
If there are other MCMachineDependentFragment's in the section, some may shrink while some may expand.
In some cases the following loop will not converge
bool MCAssembler::layoutOnce(MCAsmLayout &Layout) {
  ++stats::RelaxationSteps;

  bool WasRelaxed = false;
  for (iterator it = begin(), ie = end(); it != ie; ++it) {
    MCSection &Sec = *it;
    while (layoutSectionOnce(Layout, Sec)) ///
      WasRelaxed = true;
  }

  return WasRelaxed;
}

// In MCAssembler::layoutSectionOnce,
  case MCFragment::FT_MachineDependent:
    RelaxedFrag =
        relaxMachineDependent(Layout, *cast<MCMachineDependentFragment>(I));
    break;
To give a concrete example, clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge. You may also try dtor-*.cpp in that directory.

A simple iterative algorithm is not guaranteed to converge. We probably can solve the layout problem with a dynamic programming algorithm:

f[i][j] = the minimum cost that layouts the first i instructions with j extra bytes (via NOPs or prefixes)
or
g[i][j] = the minimum inserted bytes that layouts the first i instructions with cost j

I am not clear which one is better. A simple greedy approach is to set an upper limit on the number of iterations when the section contains at least one MCMachineDependentFragment, i.e.
bool HasMCMachineDependentFragment = false;
int count = 5; // arbitrary. Please find an appropriate value.
while (layoutSectionOnce(Layout, Sec, HasMCMachineDependentFragment) && count > 0) {
  WasRelaxed = true;
  if (HasMCMachineDependentFragment)
    count--;
}

I guess you originally wanted to say (90 + 6)%32 == 0 instead of (0x90 +6)%32 == 0. With the last patch, the command

clang++ -fsanitize=memory compiler-rt/test/msan/cxa_atexit.cpp -mbranches-within-32B-boundaries does not converge, but this is not the fault of this simple iterative algorithm.

Let me briefly introduce this algorithm. Regardless of whether the prefix or nop is filled in, MCMachineDependentFragment is used to occupy a certain size of space to align the branch, and it has a pointer to the branch it is responsible for. Multiple MCMachineDependentFragments can work together to align a branch. For example, there are two MCMachineDependentFragment, M1 and M2 to align branch J1. In the first iteration, J1 needs 7 bytes of padding, so the size of M1 will grow to 7 as much as possible. However, M1 can only grow to 5 or smaller from some reason. At this time, M2 will grow as much as possible to 2. In the second iteration, when it's M1's turn to relax, we will subtract the size of M1 and M2 from the current address of J1 to get the size that J1 needs to padding, assuming it is 6, and then M1 will keep the size to 5 and the size of M2 will be reduced to 1. It is not difficult to see that among the MCFragment from M1 to J1, as long as the size of MCFragment other than M1 and M2 is fixed within a limited number of iterations, the sizes of M1 and M2 will be fixed within a limited number of iterations, that is, the iteration will converge.

As far as I know, the MCFragment used to store instructions includes MCDataFragment, MCRelaxableFragment and MCCompactEncodedFragment. MCDataFragment and MCCompactEncodedFragment have fixed sizes. MCRelaxableFragment is used to store instructions of variable size and will only grow and not shrink, as a result, its size will be fixed within a limited number of cycles . So as long as we ensure that there is only MCFragment that stores instructions between the instruction to be prefixed and the branch to be aligned, this iterative algorithm will converge.

The file cxa_atexit.s has more than one text section. The instruction to be prefixed and the branch to align may be in two different text sections. I forgot to check if there is a MCFragment not storing instruction between the them, which results in non-convergence. This bug has been fixed in the latest patch.

skan marked 3 inline comments as done.Dec 5 2019, 8:01 AM

skan added inline comments.

llvm/test/MC/X86/x86-64-align-branch-1b.s
10 ↗	(On Diff #232116)	Thanks for the knowledge! There is no performance benefit to add 4 prefixes to the same instruction. The instruction `movl %eax, %fs:1` has already has a prefix %fs (0x64), if option `-x86-align-branch-prefix-size=5` is used, we could add 4 more prefixes at most. The branch to be aligned needs 3-byte padding, so 3 prefixes are added to the move instruction.
llvm/test/MC/X86/x86-64-align-branch-1c.s
2 ↗	(On Diff #232116)	# CHECK-NEXT: 45: 2e 89 45 fc movl %eax, %cs:-4(%rbp) # CHECK-NEXT: 49: 89 75 f4 movl %esi, -12(%rbp) # CHECK-NEXT: 4c: 89 7d f8 movl %edi, -8(%rbp) # CHECK-NEXT: 4f: 89 75 f4 movl %esi, -12(%rbp) # CHECK-NEXT: 52: 89 75 f4 movl %esi, -12(%rbp) # CHECK-NEXT: 55: 89 75 f4 movl %esi, -12(%rbp) # CHECK-NEXT: 58: 89 75 f4 movl %esi, -12(%rbp) # CHECK-NEXT: 5b: 89 75 f4 movl %esi, -12(%rbp) # CHECK-NEXT: 5e: 5d popq %rbp # CHECK-NEXT: 5f: 5d popq %rbp # CHECK-NEXT: 60: eb 26 jmp {{.*}} The prefix 2e is added at 45: 2e 89 45 fc movl %eax, %cs:-4(%rbp)
llvm/test/MC/X86/x86-64-align-branch-1e.s
46 ↗	(On Diff #232116)	Not all the target cpu support long nop, you can find related code in X86AsmBackend::writeNopData.

In D70157#1768310, @skan wrote:

fix the bug

https://bugs.llvm.org/show_bug.cgi?id=44215

FYI: I did close the bug as fixed after verifying that the fix works for me.

We uncovered another functional issue with this patch, or at least, the interaction of this patch and other parts of LLVM. In our support for STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions of code which might be patched out. It's important that these regions are exactly N bytes as concurrent patching which doesn't replace an integral number of instructions is ill-defined on X86-64. This patch causes the N-byte nop sequence to sometimes become (N+M) bytes which breaks the patching. I believe that the XRAY support may have a similar issue.

More generally, I'm worried about the legality of arbitrarily prefixing instructions from unknown sources. In the particular example we saw, we had something along the following:

.Ltmp0:
.p2align 3, 0x90
(16 byte nop sequence)
.Ltmp3:
jmp *%rax

In addition to the patching legality issue above, padding the nop sequence does something else interesting in this example. It changes the alignment of Ltmp3. Before, Ltmp3 was always 8 byte aligned, after prefixes are added, it's not. It's not clear to me exactly what the required semantics here are, but we at least had been assuming the alignment of Ltmp3 was guaranteed in this case. (That's actually how we found the patching issue.)

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

In D70157#1771832, @reames wrote:

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

In D70157#1771841, @jyknight wrote:

In D70157#1771832, @reames wrote:

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

CC the author of D34393 - @opaparo for MCCodePadding. Intel folks may know how to contact @opaparo?

I also noticed that MCCodePadder.cpp is never updated (except a license change) after the initial check-in.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

In D70157#1771841, @jyknight wrote:

In D70157#1771832, @reames wrote:

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

I can definitely see removing the code padding stuff, since it's unused and untested.

As for the bundle mechanisms, why? It seems like exactly what we're going to want here. Regardless of the auto-detect feature, we're going to need a representation of a bundle which needs to be properly placed to avoid splitting, and the current code does that. Why not reuse the, presumable reasonable well tested, existing infrastructure? The only extra thing we seem to need is the ability to toggle off bundle formation for instruction types we don't care about. Since we're going to need an assembler spelling of that regardless, it seems like the current code is a decent baseline?

In D70157#1772019, @reames wrote:

In D70157#1771841, @jyknight wrote:

In D70157#1771832, @reames wrote:

I've been digging through the code for this for the last day or so. This is a new area for me, so it's possible I'm off base, but I have some concerns about the current design.

First, there appears to already be support for instruction bundling and alignment in the assembler today. I stumbled across the .bundle_align_mode, .bundle_start, and .bundle_end mechanism (https://lists.llvm.org/pipermail/llvm-dev/2012-December/056723.html) which seems to *heavily* overlap with this proposal. I suspect that the compiler support suggested by James and myself earlier in this thread could be implemented on to this existing mechanism.

Second, the new callbacks and infrastructure added appear to overlap heavily w/the MCCodePadding infrastructure. (Which, admitted, appears unused and untested.)

My conclusion after looking at all of that was actually that I plan to propose removing both the MCCodePadding and all the bundle-padding infrastructure, not add new stuff on top of it -- the former is unused, and I believe the latter is only for Chrome's NaCL, which is deprecated, and fairly close to being removed. If we need something similar in the future, we should certainly look to both of those for inspiration, but I don't think we need to be constrained by them.

I can definitely see removing the code padding stuff, since it's unused and untested.

As for the bundle mechanisms, why? It seems like exactly what we're going to want here. Regardless of the auto-detect feature, we're going to need a representation of a bundle which needs to be properly placed to avoid splitting, and the current code does that. Why not reuse the, presumable reasonable well tested, existing infrastructure? The only extra thing we seem to need is the ability to toggle off bundle formation for instruction types we don't care about. Since we're going to need an assembler spelling of that regardless, it seems like the current code is a decent baseline?

I created D71106 to delete MCCodePadder and accompanying classes.

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

I think it's a good suggestion to start with NOP padding as the first step. In our previous experiment, we saw that the prefix padding was slight better than NOP padding, but not much. We will retest the NOP padding and go back to you.

In D70157#1772227, @annita.zhang wrote:

Third, I have not see a justification for why complexity for instruction prefix padding is necessary. All the effected CPUs support multi-byte nops, so we're talking about a *single micro op* difference between the nop form and prefix form. Can anyone point to a performance delta due to this? If not, I'd suggest we should start with the nop form, and then build the prefix form in a generic manner for all alignment varieties.

+1.

+1. Starting from just NOP padding sounds a simple and good first step. We can explore segment override prefixes in the future.

I think it's a good suggestion to start with NOP padding as the first step. In our previous experiment, we saw that the prefix padding was slight better than NOP padding, but not much. We will retest the NOP padding and go back to you.

For whatever it may be worth: Agnor Fog's empirical research on x86 pipelines and his review of manufacturer optimization guidelines also concludes that prefixes are often preferable to NOPs on modern x86 processors. (See: https://www.agner.org/optimize/microarchitecture.pdf) This arguably isn't surprising given that the decoder needs to be good at finding instruction boundaries but the decoder isn't responsible for interpreting instructions, therefore NOPs of any size dilute decode bandwidth.

Recording something so I don't forget it when we get back to the prefix padding version. The write up on the bundle align mode stuff mentions a concerning memory overhead for the feature. Since the basic implementation techniques are similar, we need to make sure we assess the memory overhead of the prefix padding implementation. See https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm for context. I don't believe this is likely to be an issue for the nop padding variant.

In D70157#1773180, @reames wrote:

Recording something so I don't forget it when we get back to the prefix padding version. The write up on the bundle align mode stuff mentions a concerning memory overhead for the feature. Since the basic implementation techniques are similar, we need to make sure we assess the memory overhead of the prefix padding implementation. See https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm for context. I don't believe this is likely to be an issue for the nop padding variant.

From the doc of https://www.chromium.org/nativeclient/pnacl/aligned-bundling-support-in-llvm the ".bundle_align_mode" ensure each instruction that following the directive don't cross the alignment boundary and ensure the first instruction following the directive be aligned, but in this patch we only require branch (jcc, jmp, ...) instruction don't cross the alignment boundary. Another remind, this patch avoid branch instruction hit the alignment boundary, and bundle syntax doesn't support that (see section 2.1 of https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf). So I don't think bundle syntax fit current requirement perfectly.

In D70157#1760278, @jyknight wrote:

Branch alignment

The primary goal of this patch, restricting the placement of branch instructions, is a performance optimization. Similar to loop alignment, the desire is to increase speed, at the cost of code-size. However, the way this feature has been designed is a global assembler flag. I find that not ideal, because it cannot take into account hotness of a block/function, as for example loop alignment code does. Basic-block alignment of loops is explicitly opt-in on an block-by-block basis -- the compiler simply emits a p2align directive where it needs, and the assembler honors that. And so, MachineBlockPlacement::alignBlocks has a bunch of conditions under which it will avoid emitting a p2align. This seems like a good model -- the assembler does what it's told by the compiler (or assembly-writer). Making the branch-instruction-alignment work similarly seems like it would be good.

IMO it would be nicest if there could be a directive that requests to specially-align the next instruction. However, the fused-jcc case makes that quite tricky, so perhaps this ought to also be a mode which can be enabled/disabled on a region as well.

Yes, the primary goal of this patch is a performance optimization or mitigation. The intention is to provide a simple method for users to mitigate the performance impact of JCC MCU with less effort. We also provide users several options to tune the performance. But the basic idea is to make it easy for users to mitigate it and improve the performance.

Your proposal is a good idea. But I'm afraid it may not cover all the scenarios. Firstly, the proposal replies on compiler to detect the hotspots. But the compiler needs LTO and/or PGO to get the precise hot spots. Otherwise, if the compiler misses the hot spots which impact the application performance, the users have no way but have to insert the directives manually. Secondly, for the existing codes written in assembly, the compiler can't handle it. The users have to insert the directives by hand, which are pretty much work.

I think the current patch wants to give a simple and general solution to mitigate JCC MCU performance impact to both C/C++ and Assembly. And it doesn't need any source code change. I think your proposal will be a good enhancement on top of it.

In D70157#1771771, @reames wrote:

We uncovered another functional issue with this patch, or at least, the interaction of this patch and other parts of LLVM. In our support for STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions of code which might be patched out. It's important that these regions are exactly N bytes as concurrent patching which doesn't replace an integral number of instructions is ill-defined on X86-64. This patch causes the N-byte nop sequence to sometimes become (N+M) bytes which breaks the patching. I believe that the XRAY support may have a similar issue.

More generally, I'm worried about the legality of arbitrarily prefixing instructions from unknown sources. In the particular example we saw, we had something along the following:

.Ltmp0:
.p2align 3, 0x90
(16 byte nop sequence)
.Ltmp3:
jmp *%rax

In addition to the patching legality issue above, padding the nop sequence does something else interesting in this example. It changes the alignment of Ltmp3. Before, Ltmp3 was always 8 byte aligned, after prefixes are added, it's not. It's not clear to me exactly what the required semantics here are, but we at least had been assuming the alignment of Ltmp3 was guaranteed in this case. (That's actually how we found the patching issue.)

I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with your example. So according to my understanding, I slightly modified your case. (If my understand is wrong, I hope you can point it out :-). )

    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)
    .rept 2
    nop
    .endr
    jmp .Ltmp0

The instruction jmp .Ltmp0 starts at byte 0x1e and ends at byte 0x20. If we align the jump with preifx, two prefixes will be added to the .rept2 16 nop .endr. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the label '.Ltmp3' is not 8-byte aligned any more.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since the alignment is not explicitly required. For example, I think we can not assuming the instruction jmp .Ltmp0 always starts at byte 0x1e. And in fact, as long as the binary generated by the compiler does not have a one-to-one correspondence with the assembly code, at least one implict assumption of an instruction will be broken.

In D70157#1774561, @skan wrote:
I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with your example. So according to my understanding, I slightly modified your case. (If my understand is wrong, I hope you can point it out :-). )
    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)

In our case it was

andl $1, %eax

but it does not matter that much.

.rept 2
nop
.endr
jmp .Ltmp0

The instruction `jmp .Ltmp0` starts at byte 0x1e and ends at byte 0x20.

Again, in our particular case start of the sequence was at xxx8, so 8 + 16(our sequence) + 3(andl) + 5(jmp) == 32.

If we align the jump with preifx, two prefixes will be added to the .rept2 16 nop .endr. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the label '.Ltmp3' is not 8-byte aligned any more.

Yes, thats what happened.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since the alignment is not explicitly required.

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

In D70157#1775016, @annita.zhang wrote:

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

What if I insert explicit align(8) right *after* the sequence?

reames mentioned this in D71238: Align non-fused branches within 32-Byte boundary (basic case).Dec 9 2019, 5:11 PM

I just posted an alternate review (https://reviews.llvm.org/D71238) which attempts to carve out a minimum reviewable piece of complexity. The hope is that we can review that one quickly (as there are fewer interacting concerns), and then rebase this one (possibly splitting further).

I had previously suggested in review comments that we should reuse the infrastructure from .bundle_align_mode. When I sat down to actually implement that, I discovered that the code for that has a bunch of interacting assumptions about when fragments are constructed and used vs alignment boundaries. I got a version of this working, but the complexity was worrisome. I now suggest that we should take the rough approach sketched here (a separate fragment before the one being aligned), delete the essentially unused bundle mode code, and revisit a unified representation if needed for memory density at a later time. (i.e. my previous suggestion wasn't a good one)

In D70157#1775481, @fedor.sergeev wrote:

In D70157#1775016, @annita.zhang wrote:

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

skan updated this revision to Diff 233008.Dec 9 2019, 11:48 PM

In D70157#1776424, @skan wrote:

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

Well, I kinda figure that from the code behavior right now, however is it really by design or just happens to work now?
Seeing that assembler becomes very intelligent now I would rather have a strict guarantee similar to "hardcode" thing that you have here that protects my sequence
rather than relying on a fact that in current settings moving my label by 8 does not appear to be profitable to assembler.

In D70157#1777272, @fedor.sergeev wrote:

In D70157#1776424, @skan wrote:

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

Well, I kinda figure that from the code behavior right now, however is it really by design or just happens to work now?

It is by design, prefix won't be added to instruction if there is a align directive int the path to the target branch.

Seeing that assembler becomes very intelligent now I would rather have a strict guarantee similar to "hardcode" thing that you have here that protects my sequence
rather than relying on a fact that in current settings moving my label by 8 does not appear to be profitable to assembler.

It depends on what your real need is. If you just need your original implicit alignment to work, turning it into explicit alignment is enough. Or if you need a sequence not changed by
the assembler at all, maybe we need a new directive such as ".hard " or ".bundle".

Replace divide and mod operation with bitwise operation
Refine help information

andrew.w.kaylor added a subscriber: andrew.w.kaylor.Dec 12 2019, 11:29 PM

In D70157#1768338, @annita.zhang wrote:

In D70157#1768319, @chandlerc wrote:

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

Sorry for belated response. We're working hard to go through some paper work to get the performance data ready. I think maybe it's better to open a mailing thread in llvm-dev to post those performance data and discuss those suggestions.

The first data was posted in http://lists.llvm.org/pipermail/llvm-dev/2019-December/137413.html.

Thanks,
Annita

More performance data was posted on http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's move on based on the data.

rename getSegmentPrefixSize(const MCInst &MI) to countSegmentPrefix(const MCInst &MI).
remove unnecessary data member of X86AsmBackend std::vector<MCMachineDependentFragment *> PendingAlignmentFragments.
Make the message of assert in MCMachineDependentFragment more clear.

More performance data was posted on http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's move on based on the data.

Based on the SPEC data, we observed 2.6% and 1.3% performance effect in INTRATE and FPRATE geomean respectively. Performance effect on individual components were observed up to 5.1%.

The tool SW mitigation can recover the geomean to within 99% of the original performance with prefix padding to jcc+jmp+fused. The maximum performance loss was reduced to within 2.2% of the original one.

The prefix padding can provide better performance as 0.3%~0.5% in geomean than nop padding on system with micro update. In individual cases, we observed up to 1.4% performance improvement in prefix padding. On a system w/o micro update, we observed 0.7% better performance of prefix padding on INTRATE geomean.

In this SPEC test, the prefix padding to jcc+jmp+fused and prefix padding to all branches has almost the same performance. However, we observed the latter prefix padding had a little bit better performance than the previous one in some cases at the cost of code size.

Since the performance delta in prefix padding and nop padding is incremental, starting from nop padding may be easier to implement as a first step, with additional prefix padding options to explore for additional performance optimizations.

And since the current update enables a relatively simple and general solution to mitigate JCC microcode update (MCU) performance effects to both C/C++ and Assembly without any source code change, we recommend it as a starting point. Then, the proposal for compiler to generate directives could enable further enhancements.

In offline discussion, there was an agreement that we needed further coordination to make sure this patch moves forward quickly. For that reason, there will be a call happening today at 4pm Pacific. Interested parties are welcome to attend.

Zoom Meeting ID: 507-497-8898
https://azul.zoom.us/j/5074978898

Results of the meeting will be summarized and posted here.

Haven't looked into too many details yet but made some suggestions anyway...

llvm/include/llvm/MC/MCFragment.h
635	Store AlignBoundarySize as a shift value, then `needPadding` doesn't even need to call Log2_64().
llvm/lib/MC/MCFragment.cpp
426	`const auto *`. The type is obvious according to the right hand side.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
104	An unknown value is just ignored, e.g. `--x86-align-branch=unknown`. I think there should be an error, but I haven't looked into the patch detail to confidently suggest how we should surface this error.
172	} else { Should --x86-branches-within-32B-boundaries overwrite --x86-align-branch-boundary and --x86-align-branch and --x86-align-branch-prefix-size? My feeling is that it just provides a default value if either of the three options is not specified. If you are going to remove `addKind` calls here, you can delete this member function.
183	space after if No curly braces around simple statements.
626	`isa<MCMachineDependentFragment>(F)`
llvm/lib/Target/X86/X86InstrInfo.td
1020 ↗	(On Diff #234041)	Unintentional change not part of this patch?
llvm/test/MC/X86/align-branch-32-1a.s
2	Did an older version include 32-1b.s or 32-1c.s? Now they are missing.
llvm/test/MC/X86/align-branch-64-1b.s
2	Create test/MC/X86/Inputs/align-branch-64-1.s and reference it from 1[a-d].s via %S/Inputs/align-branch-64-1.s

Noting another issue we found in local testing (with an older version of this patch). This interacts badly with the implicit exception mechanism in LLVM. For that mechanism, we end up generating assembly which looks more or less like this:
Ltmp:

cmp %rsi, (%rdi)
jcc <target>

And a side table which maps TLmp to another label so that a fault at Ltmp can be interpreted as an extremely expensive branch via signal handler.

The problem is that the auto-alignment of the fused branch causes padding to be introduced which separate the label and the faulting exception, breaking the mapping.

Essentially, this comes down to an implicit assumption that the label stays tightly bundled with the following instruction.

This can happen with either nop or prefix padding.

Just wanted to say thanks for the performance data! I know it was hard to get, but it is really, really useful to help folks evaluate these kinds of changes with actual data around the options available.

skan marked an inline comment as done.Dec 16 2019, 5:17 PM

skan added inline comments.

llvm/lib/MC/MCFragment.cpp
426	Shall we keep consistent with the local code style? `const MCLEBFragment *LF = cast<MCLEBFragment>(this);` was used here.

Here are the minutes from our phone call a few minutes ago.

Attendees: Andy Kaylor, Craig Topper, Annita Zhang, Tom Stellard, Chandler Carruth, Fedor Sergeev, Philip Reames, Yuanake Luo

Status Summary

Performance data has been posted to llvm-dev. We had a side discussion about nop encoding, and it was mentioned these numbers were collected from runs targeting skylake (i.e. not generic x86). This is similar to the result we (Azul) have collected and shared summaries of previously.

GNU patch has landed Friday - this mostly fixes assembler command line.

Discussion on Approach

Three major options debated:

Assembler only - as in the current patch, assembler does all work, only command line flag
Explicit Directive only - as proposed in my alternate patch, compiler decides exactly what instructions get aligned
Region based directives - as proposed in James' last comment on review, directives enable and disable auto-padding in assembler

Use cases identified:

compiler users
    important than assembler is self contained (i.e. don't have to know compiler options for reproduceability)
    inline assembly looks a lot like assembler users
legacy assembler
    important that existing assembly works unmodified
assembler users
    "try it and see" model vs selective enable vs selective disable
    likely need to support all three

Consensus was that the region based directives met use cases the best. In particular, desire to be able to overrule default (for say, inline assembly or a JITs patchability assumptions) and then restore default. Default assembler behavior remains unchanged.

Stawman syntax proposal

.align_branch_boundary disable/default
.align_branch_boundary enable N, instructions (fused, jcc, jmp, etc..)

We need to ensure a consensus on syntax is shared w/gnu. Annita agreed to coordinate this.

Compiler would essentially just wrap generated assembly in directives.

Issue noticed while writing this up: proposed syntax assumes a default has been set, but doesn't give a way to set one. This would seem to break the desired reproducibility property for compiled code. Revision needed.

Push/Pop semantics were suggested at one point, but were thought to be non-idiomatic?

Next Steps

Annita will refresh current patch with two key changes. 1) Drop prefix support and simplify and 2) drop clang driver support for now. Desire is to minimize cycle time before next iteration so that feedback on approach can be given while reviewers are still around.

Philip will prototype directive parsing support. Annita and Yuo (??) to handle coordination on syntax.

Suggested patch split:

(current patch) command line option to set default, nop only version w/cleaned up code as much as possible
assembler directive support (draft by Philip in parallel)
(future) compiler patch to wrap by default

Side note to Annita: For you to remove "hard code", you'll have to have a placeholder for the enable/disable interface. That should probably be split and rebased in my patch.

In D70157#1787139, @reames wrote:

Here are the minutes from our phone call a few minutes ago.

Attendees: Andy Kaylor, Craig Topper, Annita Zhang, Tom Stellard, Chandler Carruth, Fedor Sergeev, Philip Reames, Yuanake Luo

Thanks for organization the meeting and making the summary.

Stawman syntax proposal

.align_branch_boundary disable/default
.align_branch_boundary enable N, instructions (fused, jcc, jmp, etc..)

...

Push/Pop semantics were suggested at one point, but were thought to be non-idiomatic?

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I confirm that the following 4 commits have been to the binutils-gdb repository (https://sourceware.org/ml/binutils/2019-12/msg00138.html https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=e379e5f385f874adb0b414f917adb1fc50e20de9).

gas: Add md_generic_table_relax_frag
i386: Align branches within a fixed boundary
i386: Add -mbranches-within-32B-boundaries
i386: Add tests for -malign-branch-boundary and -malign-branch

In D70157#1787139, @reames wrote:

Here are the minutes from our phone call a few minutes ago.

Thanks for coordinating the meeting and having a clear summary. It helps a lot to accelerate the patch review. I really appreciate it!

Annita will refresh current patch with two key changes. 1) Drop prefix support and simplify and 2) drop clang driver support for now. Desire is to minimize cycle time before next iteration so that feedback on approach can be given while reviewers are still around.

Yes, we are working on it right now. Hopefully we can submit a new patch today or tomorrow.

Philip will prototype directive parsing support. Annita and Yuo (??) to handle coordination on syntax.

I suppose it's Annita and Fangrui

Side note to Annita: For you to remove "hard code", you'll have to have a placeholder for the enable/disable interface. That should probably be split and rebased in my patch.

Let's do it in your directive patch.

In D70157#1786901, @reames wrote:
Noting another issue we found in local testing (with an older version of this patch). This interacts badly with the implicit exception mechanism in LLVM. For that mechanism, we end up generating assembly which looks more or less like this:
Ltmp:
cmp %rsi, (%rdi)
jcc <target>
And a side table which maps TLmp to another label so that a fault at Ltmp can be interpreted as an extremely expensive branch via signal handler.

The problem is that the auto-alignment of the fused branch causes padding to be introduced which separate the label and the faulting exception, breaking the mapping.

Essentially, this comes down to an implicit assumption that the label stays tightly bundled with the following instruction.

This can happen with either nop or prefix padding.

How about insert NOP before the label Ltmp?

In D70157#1787160, @MaskRay wrote:

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I rethink about the directives and prefer the .push/.pop pair as @MaskRay suggested. To be specified, I'd suggest to use .push_align_branch_boundary and .pop_align_branch_boundary to align with MC command line options. They will cowork with the command line options and overwrite the options if both are existing.

To be clarified, I described the behavior of the directives from my understanding. Feel free to speak if you have difference opinion.

.push_align_branch_boundary [N,] [instruction,]*

This directive specifies the beginning of a region which will overwrite the value set by the command line or by the previous directive. It can represent either an enabling or disabling directive controlled by parameter N. 
N indicates to align the branches within N byte boundary. The default value is 32. If N is 0, it means the branch alignment is off within this region. 
Instruction specifies types of branches to align. The value is one or multiple values from fused, jcc, jmp, call, ret and indirect. The default value is fused, jcc and jmp. (may change later)

.pop_align_branch_boundary

This directive specifies the end of a region to align branch boundary. The status will be back to which was set by the previous directive or the one set by the command line if there's no previous directive existing.

I will coordinate with GNU binutils community once we discuss and have agreement with the directives.

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

Whether this is OK or not on a particular piece of assembly-code is likely to be a generic property of the code, regardless of the purpose of the optimization. If we're going to have multiple assembler optimizations that can make use of this, it would be nice to express the "OK to pad" "not OK to pad" property only once, rather than once for each kind of optimization which might make such modifications.

In particular, I'd like to look ahead towards the potential implementation of two other features:

Allowing the assembler to prefix-pad instructions in order to avoid having to emit a NOP for p2align directives.
Allowing the assembler to do other instruction-padding performance optimizations to avoid other DSB cacheline limits.

To be concrete, I propose:
".autopad", ".noautopad": allow/disallow the assembler to emit padding via inserting a nop or prefix before any instruction, as needed.
".align_branch_boundary [N,] [instruction,]": Enable branch-boundary padding (per previous description).

In this scheme, I'd generally expect an ".align_branch_boundary" directive to be specified once at the beginning of the file, and ".autopad"/".noautopad" directives to be sprinkled throughout the file as required.

Simplify

Drop prefix padding support
Drop clang driver support
Drop default align option -x86-branches-within-32B-boundaries
Drop hardcode support

Other

Throw an error if an illegal value is passed to option -x86-align-branch
Use llvm::Align instead of uint64_t to store the information about boundary
Remove test cases for prefix padding and add more test cases for NOP padding
Rename MCMachineDependentFragment to MCBoundaryAlignFragment

In D70157#1787403, @annita.zhang wrote:

In D70157#1787160, @MaskRay wrote:

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I rethink about the directives and prefer the .push/.pop pair as @MaskRay suggested. To be specified, I'd suggest to use .push_align_branch_boundary and .pop_align_branch_boundary to align with MC command line options. They will cowork with the command line options and overwrite the options if both are existing.

I agree that we need the push/pop semantics.

To be clarified, I described the behavior of the directives from my understanding. Feel free to speak if you have difference opinion.

.push_align_branch_boundary [N,] [instruction,]*

This directive specifies the beginning of a region which will overwrite the value set by the command line or by the previous directive. It can represent either an enabling or disabling directive controlled by parameter N. 
N indicates to align the branches within N byte boundary. The default value is 32. If N is 0, it means the branch alignment is off within this region. 
Instruction specifies types of branches to align. The value is one or multiple values from fused, jcc, jmp, call, ret and indirect. The default value is fused, jcc and jmp. (may change later)

I'd remove the defaults. Let's just be explicit about what is being enabled/disabled.

In D70157#1788025, @jyknight wrote:

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

James, I think this proposal is increasing the scope of this proposal too much. It also ignores some of the use cases identified and described in the writeup (i.e. the scoped semantics). I'm open to discussing such a feature more generally, but I'd prefer to see a more narrowly focused feature immediately.

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

BoundaryAlign w/Target = 3
DataFragment containing TEST RAX, RAX
RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

In D70157#1788418, @reames wrote:

In D70157#1788025, @jyknight wrote:

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

James, I think this proposal is increasing the scope of this proposal too much. It also ignores some of the use cases identified and described in the writeup (i.e. the scoped semantics). I'm open to discussing such a feature more generally, but I'd prefer to see a more narrowly focused feature immediately.

I do not intend that we expand the scope of the project to include any of the other features.

All I want is to slightly consider surrounding features when adding the new assembly syntax. The situations where we want to avoid modifying a certain block of code are extremely likely to apply to any nop-or-prefix-introducing code modifications -- not just modifications resulting from branch alignment. So if we can make the directives annotating where such changes are allowable (and conversely, where they are not) generally-applicable, with a very minimal amount of work, that would be nice.

I also don't understand what you mean by "it ignores [...] scoped semantics"?

I've updated the draft assembler support in https://reviews.llvm.org/D71315 to match the proposal here. Again, this is very much WIP and mostly just to show proposed syntax/usage.

In D70157#1788445, @reames wrote:

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

BoundaryAlign w/Target = 3

DataFragment containing TEST RAX, RAX

RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

JUMP is not always emiteed into MCRelaxableFragment, it also can be emitted into MCDataFragment and arithmetic ops with constant arguments of unknown value (e.g. ADD,AND) can be emitted into MCRelaxableFragment , you can find related code in MCObjectStreamer::EmitInstructionImpl, X86AsmBackend::mayNeedRelaxation. Let's say JCC is fused with TEST, there are four possible positions for JCC and CMP

JCC and CMP are in same MCDataFragment
JCC and CMP are in two different MCDataFragment
JCC and CMP are in two different MCRelaxableFragment
JCC in a MCRelaxableFragment, CMP is in a MCDataFragment

and since MCCompactEncodedInstFragment is not applicable yet, i don't what's its behaviour.

In order to compute the total size of CMP and JCC in MCAssembler::relaxBoundaryAlign, I insert a FusedJccSplit to force CMP and JCC in two fragments. Do you have any better idea?

Simplify

Drop the subtype of MCBoundaryAlignFragment and add data member EmitNops to indicate whether NOPs should be emitted.

rename MCBoundaryAlignFragment::hasEmitNop() to MCBoundaryAlignFragment::canEmitNop()

reduce the number of MCBoundaryAlignFragment emitted as possible

Fix a typo in MCFragment::dump()

move the code that checks if we can reuse the current MCBoundaryAlignFragment into the function X86AsmBackend::getOrCreateBoundaryAlignFragment

Add more comment

In D70157#1789159, @skan wrote:

In D70157#1788445, @reames wrote:

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

BoundaryAlign w/Target = 3

DataFragment containing TEST RAX, RAX

RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

JUMP is not always emiteed into MCRelaxableFragment, it also can be emitted into MCDataFragment and arithmetic ops with constant arguments of unknown value (e.g. ADD,AND) can be emitted into MCRelaxableFragment , you can find related code in MCObjectStreamer::EmitInstructionImpl, X86AsmBackend::mayNeedRelaxation. Let's say JCC is fused with TEST, there are four possible positions for JCC and CMP

JCC and CMP are in same MCDataFragment

JCC and CMP are in two different MCDataFragment

JCC and CMP are in two different MCRelaxableFragment

JCC in a MCRelaxableFragment, CMP is in a MCDataFragment

and since MCCompactEncodedInstFragment is not applicable yet, i don't what's its behaviour.

In order to compute the total size of CMP and JCC in MCAssembler::relaxBoundaryAlign, I insert a FusedJccSplit to force CMP and JCC in two fragments. Do you have any better idea?

I agree there are multiple cases here, see the original generic description instead of the specific example. The general question is why a *range* of fragments can't be defined. Computing the instruction size for the entire range then just requires walking from first to last fragment in the range summing the size of each. If both instructions are within the same data fragment, then no relaxation is needed, and the size of both is simply the size of the data fragment.

Unless I'm missing something here?

LGTM. The patch isn't perfect, but this has reached the point where landing and iterating in tree is the fastest way to convergence. To be clear, this LGTM *only* applies to the current patch contents, and the *internal only* flag names this introduces. These may change before we expose this publicly.

Warts with the current patch we should iterate to address:

Stylistic issues w.r.t. comments, naming, static vs member functions remain. None are show stoppers.
Many of the tests can probably be simplified and condensed.
The bundling scheme used is probably more complicated than needed (see previous suggestions).
This patch doesn't include anyway to locally disable padding, and thus is *known incorrect* in some cases. As this remains off by default, this is not a blocker for commit.

p.s. I spoke w/James this morning, and we came up with some revisions in approach he's going to suggest separately. We did explicitly discuss the status of the current patch, and whether it needed further review cycles or could be iterated in tree. Normally, I'd wait for him to summarize that conversation publicly, but given the time delay and vacation schedules involved, I want to avoid loosing another day cycle.

llvm/include/llvm/MC/MCFragment.h
570	Please add a note that the size is lazily set during relaxation, and is not meaningful before that.
llvm/test/MC/X86/align-branch-32-1a.s
35	For testing alignment functionality, we can probably use a repl int3 pattern here. It would make the tests much more concise, and shouldn't effect the logic being (currently) tested.

To be concrete, I propose:
".autopad", ".noautopad": allow/disallow the assembler to emit padding via inserting a nop or prefix before any instruction, as needed.
".align_branch_boundary [N,] [instruction,]": Enable branch-boundary padding (per previous description).

I had thought I sent the comments yesterday, but I didn't. :( I think my comments are aligned with the conclusion that Philip and Jame got. Wait for Jame's summary.

In general, I think it's a good idea to have generic directives to control the padding behavior in assembler. ".autopad", ".noautopad", ".align_branch_boundary" looks good except it can't handle the nested scenarios. I can imagine the nested cases in assembly file which includes another assembly file. If we want to handle it, we need to have a pair of directives for each above. It will make the semantics complicated. We need elaborately design it.
I'd echo what Philip said. We want a more focused and basic implementation here. We're very open to have more discussion on generic directives. However, I'd prefer it's a separate topic from this one.

In D70157#1791347, @reames wrote:

The general question is why a *range* of fragments can't be defined. Computing the instruction size for the entire range then just requires walking from first to last fragment in the range summing the size of each. If both instructions are within the same data fragment, then no relaxation is needed, and the size of both is simply the size of the data fragment.

Unless I'm missing something here?

Thank you for your detailed explanation, the solution has been consistent with your suggestion now.

MaskRay added inline comments.Dec 19 2019, 9:40 PM

llvm/lib/MC/MCAssembler.cpp
993	be relaxed
997	uint64_t
999	`auto` -> `MCFragment`
1003	unsigned -> int
1007	uint64_t

skan marked an inline comment as done.Dec 19 2019, 10:59 PM

skan added inline comments.

llvm/lib/MC/MCAssembler.cpp
1003	Why use `int` here?

Simplify the test cases.
Add some comments

skan marked 6 inline comments as done.Dec 19 2019, 11:44 PM

MaskRay added inline comments.Dec 20 2019, 12:04 AM

llvm/lib/MC/MCAssembler.cpp
1003	The loop variable can only be 0 or 1. 0U 1U 2U the unsigned suffix are just redundant. `int` suffices. It also improves readability a bit by avoiding `auto`.

Do you think this patch is ready to land? @MaskRay

In D70157#1792262, @skan wrote:

Do you think this patch is ready to land? @MaskRay

It is 00:35 here and I should confess that I haven't read through this yet.

There are some minor things like (1) pervasive auto (2) report_fatal_error in X86AlignBranchKind is not suitable. I expect the error is reported at a command line parsing stage. (3) I don't see how __tls_get_addr is referenced in code but somehow it magically appears in tests. It may be related to VK_NONE but there should be more tests for !VK_NONE cases. (4) More function-level comments are needed.

If it is not super urgent, probably want a couple of hours to get a response from @reames or @jyknight whether they think this can be committed as is, with iteration work in the future?

(Also note that @fedor.sergeev has requested changes. The convention is to wait for @fedor.sergeev to agree that this can be accepted... but there are vacation schedules involved.)

llvm/lib/MC/MCAssembler.cpp
1009	`Align`

Add more tests for !VK_NONE cases.
Reduce pervasive auto

skan marked an inline comment as done.Dec 20 2019, 2:42 AM

This revision was not accepted when it landed; it landed in state Needs Review.Dec 20 2019, 11:41 AM

Closed by commit rG14fc20ca6282: Align branches within 32-Byte boundary (NOP padding) (authored by reames). · Explain Why

This revision was automatically updated to reflect the committed changes.

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

We need to settle on assembler syntax.
We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)
We can incrementally add support for prefix padding.

In D70157#1793280, @reames wrote:

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

We need to settle on assembler syntax.

We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)

We can incrementally add support for prefix padding.

If you planned to clean up, you could have made the cleanups in the original land:) Though there would be a problem of proper contribution attribution. After Subversion->Git transition, we can actually run `git commit --amend --author=ask-author-to-provide-name-and- <email>' (Though changing the author may not be fair to your contribution to this commit... oh it is so difficult.) Anyway, it is nice to see this feature before Christmas and people can start investigating its impact now.

In D70157#1793280, @reames wrote:

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

We need to settle on assembler syntax.

We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)

We can incrementally add support for prefix padding.

Thanks for landing it! And happy holiday to all!

MaskRay mentioned this in D71822: [ELF] Delete the RelExpr member R_HINT. NFC.Dec 23 2019, 11:47 AM

skan mentioned this in D72047: Add an interface emitPrefix for MCCodeEmitter.Jan 1 2020, 11:21 PM

reames mentioned this in D72303: [BranchAlign] Compiler support for suppressing branch align.Jan 6 2020, 1:25 PM

yechunliang added a subscriber: yechunliang.Jan 6 2020, 10:43 PM

skan mentioned this in D72225: Align branches within 32-Byte boundary(Prefix padding).Jan 6 2020, 11:08 PM

reames mentioned this in rG29ccb12e2c12: [BranchAlign] Compiler support for suppressing branch align.Jan 8 2020, 10:12 AM

reames mentioned this in D72738: [BranchAlign] Add master --x86-branches-within-32B-boundaries flag.Jan 14 2020, 2:52 PM

reames mentioned this in rG1a7398eca204: [BranchAlign] Add master --x86-branches-within-32B-boundaries flag.Jan 14 2020, 6:19 PM

skan mentioned this in D75268: A light-weight solution to align branches within 32B boundary by prefix padding.Feb 27 2020, 8:50 AM

MTC added a subscriber: MTC.Apr 21 2021, 11:52 PM

maksfb mentioned this in rGc82e7fd1cc26: [BOLT] Decoder cache friendly alignment wrt Intel JCC Erratum.Jan 11 2022, 1:19 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

MC/

3 lines

3 lines

45 lines

1 line

lib/

MC/

MCAssembler.cpp

80 lines

MCFragment.cpp

17 lines

MCObjectStreamer.cpp

7 lines

Target/

X86/

MCTargetDesc/

X86AsmBackend.cpp

287 lines

test/

MC/

X86/

38 lines

83 lines

32 lines

31 lines

38 lines

44 lines

17 lines

19 lines

41 lines

33 lines

43 lines

50 lines

Diff 234939

llvm/include/llvm/MC/MCAsmBackend.h

	Show All 40 Lines

	public:			public:
	MCAsmBackend(const MCAsmBackend &) = delete;			MCAsmBackend(const MCAsmBackend &) = delete;
	MCAsmBackend &operator=(const MCAsmBackend &) = delete;			MCAsmBackend &operator=(const MCAsmBackend &) = delete;
	virtual ~MCAsmBackend();			virtual ~MCAsmBackend();

	const support::endianness Endian;			const support::endianness Endian;

				virtual void alignBranchesBegin(MCObjectStreamer &OS, const MCInst &Inst) {}
				virtual void alignBranchesEnd(MCObjectStreamer &OS, const MCInst &Inst) {}

	/// lifetime management			/// lifetime management
	virtual void reset() {}			virtual void reset() {}

	/// Create a new MCObjectWriter instance for use by the assembler backend to			/// Create a new MCObjectWriter instance for use by the assembler backend to
	/// emit the final object file.			/// emit the final object file.
	std::unique_ptr<MCObjectWriter>			std::unique_ptr<MCObjectWriter>
	createObjectWriter(raw_pwrite_stream &OS) const;			createObjectWriter(raw_pwrite_stream &OS) const;

	▲ Show 20 Lines • Show All 130 Lines • Show Last 20 Lines

llvm/include/llvm/MC/MCAssembler.h

Show First 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	private:
/// were adjusted.		/// were adjusted.
bool layoutOnce(MCAsmLayout &Layout);		bool layoutOnce(MCAsmLayout &Layout);

/// Perform one layout iteration of the given section and return true		/// Perform one layout iteration of the given section and return true
/// if any offsets were adjusted.		/// if any offsets were adjusted.
bool layoutSectionOnce(MCAsmLayout &Layout, MCSection &Sec);		bool layoutSectionOnce(MCAsmLayout &Layout, MCSection &Sec);

bool relaxInstruction(MCAsmLayout &Layout, MCRelaxableFragment &IF);		bool relaxInstruction(MCAsmLayout &Layout, MCRelaxableFragment &IF);

bool relaxLEB(MCAsmLayout &Layout, MCLEBFragment &IF);		bool relaxLEB(MCAsmLayout &Layout, MCLEBFragment &IF);
		bool relaxBoundaryAlign(MCAsmLayout &Layout, MCBoundaryAlignFragment &BF);
bool relaxDwarfLineAddr(MCAsmLayout &Layout, MCDwarfLineAddrFragment &DF);		bool relaxDwarfLineAddr(MCAsmLayout &Layout, MCDwarfLineAddrFragment &DF);
bool relaxDwarfCallFrameFragment(MCAsmLayout &Layout,		bool relaxDwarfCallFrameFragment(MCAsmLayout &Layout,
MCDwarfCallFrameFragment &DF);		MCDwarfCallFrameFragment &DF);
bool relaxCVInlineLineTable(MCAsmLayout &Layout,		bool relaxCVInlineLineTable(MCAsmLayout &Layout,
MCCVInlineLineTableFragment &DF);		MCCVInlineLineTableFragment &DF);
bool relaxCVDefRange(MCAsmLayout &Layout, MCCVDefRangeFragment &DF);		bool relaxCVDefRange(MCAsmLayout &Layout, MCCVDefRangeFragment &DF);

/// finishLayout - Finalize a layout, including fragment lowering.		/// finishLayout - Finalize a layout, including fragment lowering.
▲ Show 20 Lines • Show All 262 Lines • Show Last 20 Lines

llvm/include/llvm/MC/MCFragment.h

Show All 10 Lines

#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/SmallString.h"		#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/ilist_node.h"		#include "llvm/ADT/ilist_node.h"
#include "llvm/MC/MCFixup.h"		#include "llvm/MC/MCFixup.h"
#include "llvm/MC/MCInst.h"		#include "llvm/MC/MCInst.h"
		#include "llvm/Support/Alignment.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/SMLoc.h"		#include "llvm/Support/SMLoc.h"
#include <cstdint>		#include <cstdint>
#include <utility>		#include <utility>

namespace llvm {		namespace llvm {

class MCSection;		class MCSection;
Show All 9 Lines	enum FragmentType : uint8_t {
FT_Data,		FT_Data,
FT_CompactEncodedInst,		FT_CompactEncodedInst,
FT_Fill,		FT_Fill,
FT_Relaxable,		FT_Relaxable,
FT_Org,		FT_Org,
FT_Dwarf,		FT_Dwarf,
FT_DwarfFrame,		FT_DwarfFrame,
FT_LEB,		FT_LEB,
		FT_BoundaryAlign,
FT_SymbolId,		FT_SymbolId,
FT_CVInlineLines,		FT_CVInlineLines,
FT_CVDefRange,		FT_CVDefRange,
FT_Dummy		FT_Dummy
};		};

private:		private:
FragmentType Kind;		FragmentType Kind;
▲ Show 20 Lines • Show All 506 Lines • ▼ Show 20 Lines	public:
StringRef getFixedSizePortion() const { return FixedSizePortion; }		StringRef getFixedSizePortion() const { return FixedSizePortion; }
/// @}		/// @}

static bool classof(const MCFragment *F) {		static bool classof(const MCFragment *F) {
return F->getKind() == MCFragment::FT_CVDefRange;		return F->getKind() == MCFragment::FT_CVDefRange;
}		}
};		};

		class MCBoundaryAlignFragment : public MCFragment {
		private:
		/// The size of the MCBoundaryAlignFragment.
		reamesUnsubmitted Done Reply Inline Actions Please add a note that the size is lazily set during relaxation, and is not meaningful before that. reames: Please add a note that the size is lazily set during relaxation, and is not meaningful before…
		/// Note: The size is lazily set during relaxation, and is not meaningful
		MaskRayUnsubmitted Done Reply Inline Actions Don’t duplicate function or class name at the beginning of the comment (`BranchPadding -` ). (ref: https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments) MaskRay: Don’t duplicate function or class name at the beginning of the comment (`BranchPadding - `).
		/// before that.
		uint64_t Size = 0;
		/// The alignment requirement of the branch to be aligned.
		Align AlignBoundary;
		/// Flag to indicate whether the branch is fused.
		bool Fused : 1;
		/// Flag to indicate whether NOPs should be emitted.
		bool EmitNops : 1;

		public:
		MCBoundaryAlignFragment(Align AlignBoundary, bool Fused = false,
		MaskRayUnsubmitted Done Reply Inline Actions Full stop. MaskRay: Full stop.
		bool EmitNops = false, MCSection *Sec = nullptr)
		: MCFragment(FT_BoundaryAlign, false, Sec), AlignBoundary(AlignBoundary),
		Fused(Fused), EmitNops(EmitNops) {}

		/// \name Accessors
		/// @{

		Align getAlignment() const { return AlignBoundary; }

		uint64_t getSize() const { return Size; }
		efriedmaUnsubmitted Not Done Reply Inline Actions Global variables are forbidden in LLVM libraries; there could be multiple LLVMContexts in the same process. efriedma: Global variables are forbidden in LLVM libraries; there could be multiple LLVMContexts in the…

		bool canEmitNops() const { return EmitNops; }

		bool isFused() const { return Fused; }

		void setFused(bool Value) { Fused = Value; }

		void setEmitNops(bool Value) { EmitNops = Value; }

		void setSize(uint64_t Value) { Size = Value; }

		/// @}
		//

		static bool classof(const MCFragment *F) {
		return F->getKind() == MCFragment::FT_BoundaryAlign;
		}
		};
} // end namespace llvm		} // end namespace llvm

#endif // LLVM_MC_MCFRAGMENT_H		#endif // LLVM_MC_MCFRAGMENT_H
		MaskRayUnsubmitted Done Reply Inline Actions Move llvm_unreachable below the switch, otherwise clang will give a warning: warning: default label in switch which covers all enumera tion values [-Wcovered-switch-default] Unfortunately all GCC (even 9) -Wall will warn `warning: control reaches end of non-void function [-Wreturn-type]` unless you place an unreachable statement. MaskRay: Move llvm_unreachable below the switch, otherwise clang will give a warning: warning…
		MaskRayUnsubmitted Not Done Reply Inline Actions Store AlignBoundarySize as a shift value, then `needPadding` doesn't even need to call Log2_64(). MaskRay: Store AlignBoundarySize as a shift value, then `needPadding` doesn't even need to call Log2_64…

llvm/include/llvm/MC/MCObjectStreamer.h

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	PendingMCFixup(const MCSymbol McSym, MCDataFragment F, MCFixup McFixup)
: Sym(McSym), Fixup(McFixup), DF(F) {}		: Sym(McSym), Fixup(McFixup), DF(F) {}
};		};
SmallVector<PendingMCFixup, 2> PendingFixups;		SmallVector<PendingMCFixup, 2> PendingFixups;

virtual void EmitInstToData(const MCInst &Inst, const MCSubtargetInfo&) = 0;		virtual void EmitInstToData(const MCInst &Inst, const MCSubtargetInfo&) = 0;
void EmitCFIStartProcImpl(MCDwarfFrameInfo &Frame) override;		void EmitCFIStartProcImpl(MCDwarfFrameInfo &Frame) override;
void EmitCFIEndProcImpl(MCDwarfFrameInfo &Frame) override;		void EmitCFIEndProcImpl(MCDwarfFrameInfo &Frame) override;
MCSymbol *EmitCFILabel() override;		MCSymbol *EmitCFILabel() override;
		void EmitInstructionImpl(const MCInst &Inst, const MCSubtargetInfo &STI);
void resolvePendingFixups();		void resolvePendingFixups();

protected:		protected:
MCObjectStreamer(MCContext &Context, std::unique_ptr<MCAsmBackend> TAB,		MCObjectStreamer(MCContext &Context, std::unique_ptr<MCAsmBackend> TAB,
std::unique_ptr<MCObjectWriter> OW,		std::unique_ptr<MCObjectWriter> OW,
std::unique_ptr<MCCodeEmitter> Emitter);		std::unique_ptr<MCCodeEmitter> Emitter);
~MCObjectStreamer();		~MCObjectStreamer();

▲ Show 20 Lines • Show All 145 Lines • Show Last 20 Lines

llvm/lib/MC/MCAssembler.cpp

Show First 20 Lines • Show All 303 Lines • ▼ Show 20 Lines	if (Size < 0) {
return 0;		return 0;
}		}
return Size;		return Size;
}		}

case MCFragment::FT_LEB:		case MCFragment::FT_LEB:
return cast<MCLEBFragment>(F).getContents().size();		return cast<MCLEBFragment>(F).getContents().size();

		case MCFragment::FT_BoundaryAlign:
		return cast<MCBoundaryAlignFragment>(F).getSize();

case MCFragment::FT_SymbolId:		case MCFragment::FT_SymbolId:
return 4;		return 4;

case MCFragment::FT_Align: {		case MCFragment::FT_Align: {
const MCAlignFragment &AF = cast<MCAlignFragment>(F);		const MCAlignFragment &AF = cast<MCAlignFragment>(F);
unsigned Offset = Layout.getFragmentOffset(&AF);		unsigned Offset = Layout.getFragmentOffset(&AF);
unsigned Size = offsetToAlignment(Offset, Align(AF.getAlignment()));		unsigned Size = offsetToAlignment(Offset, Align(AF.getAlignment()));

▲ Show 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	static void writeFragment(raw_ostream &OS, const MCAssembler &Asm,
}		}

case MCFragment::FT_LEB: {		case MCFragment::FT_LEB: {
const MCLEBFragment &LF = cast<MCLEBFragment>(F);		const MCLEBFragment &LF = cast<MCLEBFragment>(F);
OS << LF.getContents();		OS << LF.getContents();
break;		break;
}		}

		case MCFragment::FT_BoundaryAlign: {
		if (!Asm.getBackend().writeNopData(OS, FragmentSize))
		report_fatal_error("unable to write nop sequence of " +
		Twine(FragmentSize) + " bytes");
		break;
		}

case MCFragment::FT_SymbolId: {		case MCFragment::FT_SymbolId: {
const MCSymbolIdFragment &SF = cast<MCSymbolIdFragment>(F);		const MCSymbolIdFragment &SF = cast<MCSymbolIdFragment>(F);
support::endian::write<uint32_t>(OS, SF.getSymbol()->getIndex(), Endian);		support::endian::write<uint32_t>(OS, SF.getSymbol()->getIndex(), Endian);
break;		break;
}		}

case MCFragment::FT_Org: {		case MCFragment::FT_Org: {
++stats::EmittedOrgFragments;		++stats::EmittedOrgFragments;
▲ Show 20 Lines • Show All 320 Lines • ▼ Show 20 Lines	bool MCAssembler::relaxLEB(MCAsmLayout &Layout, MCLEBFragment &LF) {
// only increase an LEB fragment size here, not decrease it. See PR35809.		// only increase an LEB fragment size here, not decrease it. See PR35809.
if (LF.isSigned())		if (LF.isSigned())
encodeSLEB128(Value, OSE, OldSize);		encodeSLEB128(Value, OSE, OldSize);
else		else
encodeULEB128(Value, OSE, OldSize);		encodeULEB128(Value, OSE, OldSize);
return OldSize != LF.getContents().size();		return OldSize != LF.getContents().size();
}		}

		/// Check if the branch crosses the boundary.
		///
		/// \param StartAddr start address of the fused/unfused branch.
		/// \param Size size of the fused/unfused branch.
		/// \param BoundaryAlignment aligment requirement of the branch.
		/// \returns true if the branch cross the boundary.
		static bool mayCrossBoundary(uint64_t StartAddr, uint64_t Size,
		Align BoundaryAlignment) {
		uint64_t EndAddr = StartAddr + Size;
		return (StartAddr >> Log2(BoundaryAlignment)) !=
		((EndAddr - 1) >> Log2(BoundaryAlignment));
		}

		/// Check if the branch is against the boundary.
		///
		/// \param StartAddr start address of the fused/unfused branch.
		/// \param Size size of the fused/unfused branch.
		/// \param BoundaryAlignment aligment requirement of the branch.
		/// \returns true if the branch is against the boundary.
		static bool isAgainstBoundary(uint64_t StartAddr, uint64_t Size,
		Align BoundaryAlignment) {
		uint64_t EndAddr = StartAddr + Size;
		MaskRayUnsubmitted Not Done Reply Inline Actions Division is slow. Pass in the power of 2 and use right shift instead. You may change `MCMachineDependentFragment::AlignBoundarySize` (a power of 2) to a power. MaskRay: Division is slow. Pass in the power of 2 and use right shift instead. You may change…
		return (EndAddr & (BoundaryAlignment.value() - 1)) == 0;
		}

		/// Check if the branch needs padding.
		///
		/// \param StartAddr start address of the fused/unfused branch.
		/// \param Size size of the fused/unfused branch.
		/// \param BoundaryAlignment aligment requirement of the branch.
		MaskRayUnsubmitted Done Reply Inline Actions Ditto. MaskRay: Ditto.
		/// \returns true if the branch needs padding.
		static bool needPadding(uint64_t StartAddr, uint64_t Size,
		Align BoundaryAlignment) {
		return mayCrossBoundary(StartAddr, Size, BoundaryAlignment) \|\|
		isAgainstBoundary(StartAddr, Size, BoundaryAlignment);
		}

		bool MCAssembler::relaxBoundaryAlign(MCAsmLayout &Layout,
		MCBoundaryAlignFragment &BF) {
		// The MCBoundaryAlignFragment that doesn't emit NOP should not be relaxed.
		MaskRayUnsubmitted Done Reply Inline Actions be relaxed MaskRay: be relaxed
		if (!BF.canEmitNops())
		return false;

		uint64_t AlignedOffset = Layout.getFragmentOffset(BF.getNextNode());
		MaskRayUnsubmitted Done Reply Inline Actions uint64_t MaskRay: uint64_t
		uint64_t AlignedSize = 0;
		const MCFragment *F = BF.getNextNode();
		MaskRayUnsubmitted Done Reply Inline Actions `auto` -> `MCFragment` MaskRay: `auto` -> `MCFragment`
		// If the branch is unfused, it is emitted into one fragment, otherwise it is
		// emitted into two fragments at most, the next MCBoundaryAlignFragment(if
		// exists) also marks the end of the branch.
		jyknightUnsubmitted Done Reply Inline Actions I think this is broken -- moving symbols around like this risks causing evaluations of symbol offsets which may have already happened to be wrong. Also it's ugly, and I can't tell why it's necessary, because there's no comments. jyknight: I think this is broken -- moving symbols around like this risks causing evaluations of symbol…
		for (auto i = 0, N = BF.isFused() ? 2 : 1;
		MaskRayUnsubmitted Not Done Reply Inline Actions unsigned -> int MaskRay: unsigned -> int
		skanAuthorUnsubmitted Done Reply Inline Actions Why use `int` here? skan: Why use `int` here?
		MaskRayUnsubmitted Not Done Reply Inline Actions The loop variable can only be 0 or 1. 0U 1U 2U the unsigned suffix are just redundant. `int` suffices. It also improves readability a bit by avoiding `auto`. MaskRay: The loop variable can only be 0 or 1. 0U 1U 2U the unsigned suffix are just redundant. `int`…
		i != N && !isa<MCBoundaryAlignFragment>(F); ++i, F = F->getNextNode()) {
		AlignedSize += computeFragmentSize(Layout, *F);
		}
		uint64_t OldSize = BF.getSize();
		MaskRayUnsubmitted Done Reply Inline Actions uint64_t MaskRay: uint64_t
		AlignedOffset -= OldSize;
		Align BoundaryAlignment = BF.getAlignment();
		MaskRayUnsubmitted Done Reply Inline Actions `Align` MaskRay: `Align`
		uint64_t NewSize = needPadding(AlignedOffset, AlignedSize, BoundaryAlignment)
		? offsetToAlignment(AlignedOffset, BoundaryAlignment)
		: 0U;
		if (NewSize == OldSize)
		return false;
		BF.setSize(NewSize);
		Layout.invalidateFragmentsFrom(&BF);
		return true;
		}

		jyknightUnsubmitted Not Done Reply Inline Actions I don't think this is necessary. AFAICT, the symbols should already be in the right place -- pointing to the relax fragment, not the instruction itself, without this. And removing all this moveSymbol/updateSymbolMap code doesn't make any tests fail. jyknight: I don't think this is necessary. AFAICT, the symbols should already be in the right place…
		skanAuthorUnsubmitted Done Reply Inline Actions Yes, I check it and you are right. I will removing all this moveSymbol/updateSymbolMap code. skan: Yes, I check it and you are right. I will removing all this moveSymbol/updateSymbolMap code.
bool MCAssembler::relaxDwarfLineAddr(MCAsmLayout &Layout,		bool MCAssembler::relaxDwarfLineAddr(MCAsmLayout &Layout,
MCDwarfLineAddrFragment &DF) {		MCDwarfLineAddrFragment &DF) {
MCContext &Context = Layout.getAssembler().getContext();		MCContext &Context = Layout.getAssembler().getContext();
uint64_t OldSize = DF.getContents().size();		uint64_t OldSize = DF.getContents().size();
int64_t AddrDelta;		int64_t AddrDelta;
bool Abs = DF.getAddrDelta().evaluateKnownAbsolute(AddrDelta, Layout);		bool Abs = DF.getAddrDelta().evaluateKnownAbsolute(AddrDelta, Layout);
assert(Abs && "We created a line delta with an invalid expression");		assert(Abs && "We created a line delta with an invalid expression");
(void)Abs;		(void)Abs;
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	for (MCSection::iterator I = Sec.begin(), IE = Sec.end(); I != IE; ++I) {
case MCFragment::FT_DwarfFrame:		case MCFragment::FT_DwarfFrame:
RelaxedFrag =		RelaxedFrag =
relaxDwarfCallFrameFragment(Layout,		relaxDwarfCallFrameFragment(Layout,
*cast<MCDwarfCallFrameFragment>(I));		*cast<MCDwarfCallFrameFragment>(I));
break;		break;
case MCFragment::FT_LEB:		case MCFragment::FT_LEB:
RelaxedFrag = relaxLEB(Layout, *cast<MCLEBFragment>(I));		RelaxedFrag = relaxLEB(Layout, *cast<MCLEBFragment>(I));
break;		break;
		case MCFragment::FT_BoundaryAlign:
		RelaxedFrag =
		relaxBoundaryAlign(Layout, *cast<MCBoundaryAlignFragment>(I));
		break;
case MCFragment::FT_CVInlineLines:		case MCFragment::FT_CVInlineLines:
RelaxedFrag =		RelaxedFrag =
relaxCVInlineLineTable(Layout, *cast<MCCVInlineLineTableFragment>(I));		relaxCVInlineLineTable(Layout, *cast<MCCVInlineLineTableFragment>(I));
break;		break;
case MCFragment::FT_CVDefRange:		case MCFragment::FT_CVDefRange:
RelaxedFrag = relaxCVDefRange(Layout, *cast<MCCVDefRangeFragment>(I));		RelaxedFrag = relaxCVDefRange(Layout, *cast<MCCVDefRangeFragment>(I));
break;		break;
}		}
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

llvm/lib/MC/MCFragment.cpp

Show First 20 Lines • Show All 269 Lines • ▼ Show 20 Lines	case FT_Dwarf:
delete cast<MCDwarfLineAddrFragment>(this);		delete cast<MCDwarfLineAddrFragment>(this);
return;		return;
case FT_DwarfFrame:		case FT_DwarfFrame:
delete cast<MCDwarfCallFrameFragment>(this);		delete cast<MCDwarfCallFrameFragment>(this);
return;		return;
case FT_LEB:		case FT_LEB:
delete cast<MCLEBFragment>(this);		delete cast<MCLEBFragment>(this);
return;		return;
		case FT_BoundaryAlign:
		delete cast<MCBoundaryAlignFragment>(this);
		return;
case FT_SymbolId:		case FT_SymbolId:
delete cast<MCSymbolIdFragment>(this);		delete cast<MCSymbolIdFragment>(this);
return;		return;
case FT_CVInlineLines:		case FT_CVInlineLines:
delete cast<MCCVInlineLineTableFragment>(this);		delete cast<MCCVInlineLineTableFragment>(this);
return;		return;
case FT_CVDefRange:		case FT_CVDefRange:
delete cast<MCCVDefRangeFragment>(this);		delete cast<MCCVDefRangeFragment>(this);
Show All 28 Lines	LLVM_DUMP_METHOD void MCFragment::dump() const {
case MCFragment::FT_CompactEncodedInst:		case MCFragment::FT_CompactEncodedInst:
OS << "MCCompactEncodedInstFragment"; break;		OS << "MCCompactEncodedInstFragment"; break;
case MCFragment::FT_Fill: OS << "MCFillFragment"; break;		case MCFragment::FT_Fill: OS << "MCFillFragment"; break;
case MCFragment::FT_Relaxable: OS << "MCRelaxableFragment"; break;		case MCFragment::FT_Relaxable: OS << "MCRelaxableFragment"; break;
case MCFragment::FT_Org: OS << "MCOrgFragment"; break;		case MCFragment::FT_Org: OS << "MCOrgFragment"; break;
case MCFragment::FT_Dwarf: OS << "MCDwarfFragment"; break;		case MCFragment::FT_Dwarf: OS << "MCDwarfFragment"; break;
case MCFragment::FT_DwarfFrame: OS << "MCDwarfCallFrameFragment"; break;		case MCFragment::FT_DwarfFrame: OS << "MCDwarfCallFrameFragment"; break;
case MCFragment::FT_LEB: OS << "MCLEBFragment"; break;		case MCFragment::FT_LEB: OS << "MCLEBFragment"; break;
		case MCFragment::FT_BoundaryAlign: OS<<"MCBoundaryAlignFragment"; break;
case MCFragment::FT_SymbolId: OS << "MCSymbolIdFragment"; break;		case MCFragment::FT_SymbolId: OS << "MCSymbolIdFragment"; break;
case MCFragment::FT_CVInlineLines: OS << "MCCVInlineLineTableFragment"; break;		case MCFragment::FT_CVInlineLines: OS << "MCCVInlineLineTableFragment"; break;
case MCFragment::FT_CVDefRange: OS << "MCCVDefRangeTableFragment"; break;		case MCFragment::FT_CVDefRange: OS << "MCCVDefRangeTableFragment"; break;
case MCFragment::FT_Dummy: OS << "MCDummyFragment"; break;		case MCFragment::FT_Dummy: OS << "MCDummyFragment"; break;
}		}

OS << "<MCFragment " << (const void *)this << " LayoutOrder:" << LayoutOrder		OS << "<MCFragment " << (const void *)this << " LayoutOrder:" << LayoutOrder
<< " Offset:" << Offset << " HasInstructions:" << hasInstructions();		<< " Offset:" << Offset << " HasInstructions:" << hasInstructions();
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	case MCFragment::FT_DwarfFrame: {
break;		break;
}		}
case MCFragment::FT_LEB: {		case MCFragment::FT_LEB: {
const MCLEBFragment *LF = cast<MCLEBFragment>(this);		const MCLEBFragment *LF = cast<MCLEBFragment>(this);
OS << "\n ";		OS << "\n ";
OS << " Value:" << LF->getValue() << " Signed:" << LF->isSigned();		OS << " Value:" << LF->getValue() << " Signed:" << LF->isSigned();
break;		break;
}		}
		case MCFragment::FT_BoundaryAlign: {
		const auto *BF = cast<MCBoundaryAlignFragment>(this);
		MaskRayUnsubmitted Not Done Reply Inline Actions `const auto `. The type is obvious according to the right hand side. MaskRay:* `const auto *`. The type is obvious according to the right hand side.
		skanAuthorUnsubmitted Done Reply Inline Actions Shall we keep consistent with the local code style? `const MCLEBFragment LF = cast<MCLEBFragment>(this);` was used here. skan:* Shall we keep consistent with the local code style? `const MCLEBFragment *LF =…
		if (BF->canEmitNops())
		OS << " (can emit nops to align";
		if (BF->isFused())
		MaskRayUnsubmitted Not Done Reply Inline Actions error: use of overloaded operator '<<' is ambiguous (with opera nd types 'llvm::raw_ostream' and 'llvm::MCMachineDependentFragment::SubType') It seems you haven't defined `raw_ostream &operator<<(raw_ostream &OS, MCMachineDependentFragment::SubType)`. I believe you also missed a space before `Size:` . MaskRay: ``` error: use of overloaded operator '<<' is ambiguous (with opera nd types 'llvm…
		skanAuthorUnsubmitted Done Reply Inline Actions `llvm::MCMachineDependentFragment::SubType` is an enum type, it seems okay for me to not define `raw_ostream &operator<<(raw_ostream &OS, MCMachineDependentFragment::SubType)` and I could not reproduce the error. However, it's not clear to print a enum in the dump information. I change the code to print a string. skan: `llvm::MCMachineDependentFragment::SubType` is an enum type, it seems okay for me to not…
		OS << " fused branch)";
		else
		OS << " unfused branch)";
		OS << "\n ";
		OS << " BoundarySize:" << BF->getAlignment().value()
		<< " Size:" << BF->getSize();
		break;
		}
case MCFragment::FT_SymbolId: {		case MCFragment::FT_SymbolId: {
const MCSymbolIdFragment *F = cast<MCSymbolIdFragment>(this);		const MCSymbolIdFragment *F = cast<MCSymbolIdFragment>(this);
OS << "\n ";		OS << "\n ";
OS << " Sym:" << F->getSymbol();		OS << " Sym:" << F->getSymbol();
break;		break;
}		}
case MCFragment::FT_CVInlineLines: {		case MCFragment::FT_CVInlineLines: {
const auto *F = cast<MCCVInlineLineTableFragment>(this);		const auto *F = cast<MCCVInlineLineTableFragment>(this);
Show All 20 Lines

llvm/lib/MC/MCObjectStreamer.cpp

	Show First 20 Lines • Show All 358 Lines • ▼ Show 20 Lines
	}			}

	bool MCObjectStreamer::mayHaveInstructions(MCSection &Sec) const {			bool MCObjectStreamer::mayHaveInstructions(MCSection &Sec) const {
	return Sec.hasInstructions();			return Sec.hasInstructions();
	}			}

	void MCObjectStreamer::EmitInstruction(const MCInst &Inst,			void MCObjectStreamer::EmitInstruction(const MCInst &Inst,
	const MCSubtargetInfo &STI) {			const MCSubtargetInfo &STI) {
				getAssembler().getBackend().alignBranchesBegin(*this, Inst);
				EmitInstructionImpl(Inst, STI);
				getAssembler().getBackend().alignBranchesEnd(*this, Inst);
				}

				void MCObjectStreamer::EmitInstructionImpl(const MCInst &Inst,
				const MCSubtargetInfo &STI) {
	MCStreamer::EmitInstruction(Inst, STI);			MCStreamer::EmitInstruction(Inst, STI);

	MCSection *Sec = getCurrentSectionOnly();			MCSection *Sec = getCurrentSectionOnly();
	Sec->setHasInstructions(true);			Sec->setHasInstructions(true);

	// Now that a machine instruction has been assembled into this section, make			// Now that a machine instruction has been assembled into this section, make
	// a line entry for any .loc directive that has been seen.			// a line entry for any .loc directive that has been seen.
	MCDwarfLineEntry::Make(this, getCurrentSectionOnly());			MCDwarfLineEntry::Make(this, getCurrentSectionOnly());
	▲ Show 20 Lines • Show All 391 Lines • Show Last 20 Lines

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp

Show All 13 Lines
#include "llvm/MC/MCAsmBackend.h"		#include "llvm/MC/MCAsmBackend.h"
#include "llvm/MC/MCAssembler.h"		#include "llvm/MC/MCAssembler.h"
#include "llvm/MC/MCContext.h"		#include "llvm/MC/MCContext.h"
#include "llvm/MC/MCDwarf.h"		#include "llvm/MC/MCDwarf.h"
#include "llvm/MC/MCELFObjectWriter.h"		#include "llvm/MC/MCELFObjectWriter.h"
#include "llvm/MC/MCExpr.h"		#include "llvm/MC/MCExpr.h"
#include "llvm/MC/MCFixupKindInfo.h"		#include "llvm/MC/MCFixupKindInfo.h"
#include "llvm/MC/MCInst.h"		#include "llvm/MC/MCInst.h"
		#include "llvm/MC/MCInstrInfo.h"
#include "llvm/MC/MCMachObjectWriter.h"		#include "llvm/MC/MCMachObjectWriter.h"
		#include "llvm/MC/MCObjectStreamer.h"
#include "llvm/MC/MCObjectWriter.h"		#include "llvm/MC/MCObjectWriter.h"
#include "llvm/MC/MCRegisterInfo.h"		#include "llvm/MC/MCRegisterInfo.h"
#include "llvm/MC/MCSectionMachO.h"		#include "llvm/MC/MCSectionMachO.h"
#include "llvm/MC/MCSubtargetInfo.h"		#include "llvm/MC/MCSubtargetInfo.h"
#include "llvm/MC/MCValue.h"		#include "llvm/MC/MCValue.h"
		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
		#include "llvm/Support/TargetRegistry.h"

using namespace llvm;		using namespace llvm;

static unsigned getFixupKindSize(unsigned Kind) {		static unsigned getFixupKindSize(unsigned Kind) {
switch (Kind) {		switch (Kind) {
default:		default:
llvm_unreachable("invalid fixup kind!");		llvm_unreachable("invalid fixup kind!");
case FK_NONE:		case FK_NONE:
return 0;		return 0;
Show All 21 Lines	static unsigned getFixupKindSize(unsigned Kind) {
case FK_SecRel_8:		case FK_SecRel_8:
case FK_Data_8:		case FK_Data_8:
case X86::reloc_global_offset_table8:		case X86::reloc_global_offset_table8:
return 8;		return 8;
}		}
}		}

namespace {		namespace {
		class X86AlignBranchKind {
		private:
		uint8_t AlignBranchKind = 0;

		public:
		enum Flag : uint8_t {
		AlignBranchNone = 0,
		AlignBranchFused = 1U << 0,
		AlignBranchJcc = 1U << 1,
		AlignBranchJmp = 1U << 2,
		AlignBranchCall = 1U << 3,
		AlignBranchRet = 1U << 4,
		AlignBranchIndirect = 1U << 5
		};

		void operator=(const std::string &Val) {
		if (Val.empty())
		return;
		SmallVector<StringRef, 6> BranchTypes;
		StringRef(Val).split(BranchTypes, '+', -1, false);
		chandlercUnsubmitted Not Done Reply Inline Actions I feel like a comma-separated list would be a bit more clear... chandlerc: I feel like a comma-separated list would be a bit more clear...
		skanAuthorUnsubmitted Done Reply Inline Actions We can't use comma-separated list because we need pass the option with flto. "-Wl,-plugin-opt=--x86-align-branch-boundary=32,--plugin-opt=-x86-align-branch=fused,jcc,jmp,--plugin-opt=-x86-align-branch-prefix-size=5" would cause a compile fail because "jcc" was recognized as another option rather than a part of option "-x86-align-branch=fused,jcc,jmp" skan: We can't use comma-separated list because we need pass the option with flto. "-Wl,-plugin-opt=…
		craig.topperUnsubmitted Not Done Reply Inline Actions Isn't there some way to nest quotes into the part of after -plugin-opt= ? craig.topper: Isn't there some way to nest quotes into the part of after -plugin-opt= ?
		skanAuthorUnsubmitted Done Reply Inline Actions I tried to use --plugin-opt=-x86-align-branch="fused,jcc,jmp", it didn't work. skan: I tried to use --plugin-opt=-x86-align-branch="fused,jcc,jmp", it didn't work.
		craig.topperUnsubmitted Not Done Reply Inline Actions What if you put —plugin-opt=“-x86-align-branch=fused,jcc,jmp” craig.topper: What if you put —plugin-opt=“-x86-align-branch=fused,jcc,jmp”
		skanAuthorUnsubmitted Done Reply Inline Actions It doesn't work too. skan: It doesn't work too.
		MaskRayUnsubmitted Not Done Reply Inline Actions Both gcc and clang just split the -Wl argument by comma. There is no escape character. For reference, https://sourceware.org/ml/binutils/2019-11/msg00173.html is the GNU as patch on the binutils side. Have you considered `+` plus? I think it may be more intuitive. MaskRay: Both gcc and clang just split the -Wl argument by comma. There is no escape character. For…
		skanAuthorUnsubmitted Done Reply Inline Actions Yes, I agree with you. Done skan: Yes, I agree with you. Done
		for (auto BranchType : BranchTypes) {
		if (BranchType == "fused")
		addKind(AlignBranchFused);
		else if (BranchType == "jcc")
		addKind(AlignBranchJcc);
		else if (BranchType == "jmp")
		addKind(AlignBranchJmp);
		else if (BranchType == "call")
		addKind(AlignBranchCall);
		else if (BranchType == "ret")
		addKind(AlignBranchRet);
		else if (BranchType == "indirect")
		addKind(AlignBranchIndirect);
		MaskRayUnsubmitted Not Done Reply Inline Actions An unknown value is just ignored, e.g. `--x86-align-branch=unknown`. I think there should be an error, but I haven't looked into the patch detail to confidently suggest how we should surface this error. MaskRay: An unknown value is just ignored, e.g. `--x86-align-branch=unknown`. I think there should be an…
		else {
		report_fatal_error(
		"'-x86-align-branch 'The branches's type is combination of jcc, "
		"fused, jmp, call, ret, indirect.(plus separated)",
		false);
		}
		}
		}

		operator uint8_t() const { return AlignBranchKind; }
		void addKind(Flag Value) { AlignBranchKind \|= Value; }
		};

		X86AlignBranchKind X86AlignBranchKindLoc;

		cl::opt<uint64_t> X86AlignBranchBoundary(
		"x86-align-branch-boundary", cl::init(0),
		cl::desc(
		"Control how the assembler should align branches with NOP. If the "
		"boundary's size is not 0, it should be a power of 2 and no less "
		"than 32. Branches will be aligned within the boundary of specified "
		"size. -x86-align-branch-boundary=0 doesn't align branches."));

		cl::opt<X86AlignBranchKind, true, cl::parser<std::string>> X86AlignBranch(
		"x86-align-branch",
		cl::desc("Specify types of branches to align (plus separated list of "
		"types). The branches's type is combination of jcc, fused, "
		"jmp, call, ret, indirect."),
		cl::value_desc("jcc(conditional jump), fused(fused conditional jump), "
		"jmp(unconditional jump); call(call); ret(ret), "
		"indirect(indirect jump)."),
		cl::location(X86AlignBranchKindLoc));

class X86ELFObjectWriter : public MCELFObjectTargetWriter {		class X86ELFObjectWriter : public MCELFObjectTargetWriter {
public:		public:
X86ELFObjectWriter(bool is64Bit, uint8_t OSABI, uint16_t EMachine,		X86ELFObjectWriter(bool is64Bit, uint8_t OSABI, uint16_t EMachine,
bool HasRelocationAddend, bool foobar)		bool HasRelocationAddend, bool foobar)
: MCELFObjectTargetWriter(is64Bit, OSABI, EMachine, HasRelocationAddend) {}		: MCELFObjectTargetWriter(is64Bit, OSABI, EMachine, HasRelocationAddend) {}
};		};

class X86AsmBackend : public MCAsmBackend {		class X86AsmBackend : public MCAsmBackend {
const MCSubtargetInfo &STI;		const MCSubtargetInfo &STI;
		const MCInstrInfo &MCII;
		X86AlignBranchKind AlignBranchType;
		Align AlignBoundary;

		bool isFirstMacroFusibleInst(const MCInst &Inst) const;
		bool isMacroFused(const MCInst &Cmp, const MCInst &Jcc) const;
		bool isRIPRelative(const MCInst &MI) const;
		bool hasVariantSymbol(const MCInst &MI) const;

		bool needAlign(MCObjectStreamer &OS) const;
		bool needAlignInst(const MCInst &Inst) const;
		MCBoundaryAlignFragment *
		getOrCreateBoundaryAlignFragment(MCObjectStreamer &OS) const;
		MCInst PrevInst;

public:		public:
X86AsmBackend(const Target &T, const MCSubtargetInfo &STI)		X86AsmBackend(const Target &T, const MCSubtargetInfo &STI)
: MCAsmBackend(support::little), STI(STI) {}		: MCAsmBackend(support::little), STI(STI),
		MCII(*(T.createMCInstrInfo())) {
		AlignBoundary = assumeAligned(X86AlignBranchBoundary);
		MaskRayUnsubmitted Not Done Reply Inline Actions We can generalize these functions with a function that takes a parameter. MaskRay: We can generalize these functions with a function that takes a parameter.
		skanAuthorUnsubmitted Done Reply Inline Actions We already have a generalized function `bool needAlign(const MCInst &Inst) const`. `needAlignJcc()` is only a helper function that makes code more readable. skan: We already have a generalized function `bool needAlign(const MCInst &Inst) const`.
		AlignBranchType = X86AlignBranchKindLoc;
		}

		void alignBranchesBegin(MCObjectStreamer &OS, const MCInst &Inst) override;
		void alignBranchesEnd(MCObjectStreamer &OS, const MCInst &Inst) override;

		MaskRayUnsubmitted Not Done Reply Inline Actions } else { Should --x86-branches-within-32B-boundaries overwrite --x86-align-branch-boundary and --x86-align-branch and --x86-align-branch-prefix-size? My feeling is that it just provides a default value if either of the three options is not specified. If you are going to remove `addKind` calls here, you can delete this member function. MaskRay: > } else { Should --x86-branches-within-32B-boundaries overwrite --x86-align-branch-boundary…
unsigned getNumFixupKinds() const override {		unsigned getNumFixupKinds() const override {
return X86::NumTargetFixupKinds;		return X86::NumTargetFixupKinds;
}		}

Optional<MCFixupKind> getFixupKind(StringRef Name) const override;		Optional<MCFixupKind> getFixupKind(StringRef Name) const override;

const MCFixupKindInfo &getFixupKindInfo(MCFixupKind Kind) const override {		const MCFixupKindInfo &getFixupKindInfo(MCFixupKind Kind) const override {
const static MCFixupKindInfo Infos[X86::NumTargetFixupKinds] = {		const static MCFixupKindInfo Infos[X86::NumTargetFixupKinds] = {
{"reloc_riprel_4byte", 0, 32, MCFixupKindInfo::FKF_IsPCRel},		{"reloc_riprel_4byte", 0, 32, MCFixupKindInfo::FKF_IsPCRel},
{"reloc_riprel_4byte_movq_load", 0, 32, MCFixupKindInfo::FKF_IsPCRel},		{"reloc_riprel_4byte_movq_load", 0, 32, MCFixupKindInfo::FKF_IsPCRel},
{"reloc_riprel_4byte_relax", 0, 32, MCFixupKindInfo::FKF_IsPCRel},		{"reloc_riprel_4byte_relax", 0, 32, MCFixupKindInfo::FKF_IsPCRel},
		MaskRayUnsubmitted Not Done Reply Inline Actions space after if No curly braces around simple statements. MaskRay: space after if No curly braces around simple statements.
{"reloc_riprel_4byte_relax_rex", 0, 32, MCFixupKindInfo::FKF_IsPCRel},		{"reloc_riprel_4byte_relax_rex", 0, 32, MCFixupKindInfo::FKF_IsPCRel},
{"reloc_signed_4byte", 0, 32, 0},		{"reloc_signed_4byte", 0, 32, 0},
{"reloc_signed_4byte_relax", 0, 32, 0},		{"reloc_signed_4byte_relax", 0, 32, 0},
{"reloc_global_offset_table", 0, 32, 0},		{"reloc_global_offset_table", 0, 32, 0},
{"reloc_global_offset_table8", 0, 64, 0},		{"reloc_global_offset_table8", 0, 64, 0},
{"reloc_branch_4byte_pcrel", 0, 32, MCFixupKindInfo::FKF_IsPCRel},		{"reloc_branch_4byte_pcrel", 0, 32, MCFixupKindInfo::FKF_IsPCRel},
};		};

▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines

static unsigned getRelaxedOpcode(const MCInst &Inst, bool is16BitMode) {		static unsigned getRelaxedOpcode(const MCInst &Inst, bool is16BitMode) {
unsigned R = getRelaxedOpcodeArith(Inst);		unsigned R = getRelaxedOpcodeArith(Inst);
if (R != Inst.getOpcode())		if (R != Inst.getOpcode())
return R;		return R;
return getRelaxedOpcodeBranch(Inst, is16BitMode);		return getRelaxedOpcodeBranch(Inst, is16BitMode);
}		}

		static X86::CondCode getCondFromBranch(const MCInst &MI,
		const MCInstrInfo &MCII) {
		unsigned Opcode = MI.getOpcode();
		switch (Opcode) {
		default:
		return X86::COND_INVALID;
		case X86::JCC_1: {
		const MCInstrDesc &Desc = MCII.get(Opcode);
		return static_cast<X86::CondCode>(
		MI.getOperand(Desc.getNumOperands() - 1).getImm());
		}
		}
		}

		static X86::SecondMacroFusionInstKind
		classifySecondInstInMacroFusion(const MCInst &MI, const MCInstrInfo &MCII) {
		X86::CondCode CC = getCondFromBranch(MI, MCII);
		return classifySecondCondCodeInMacroFusion(CC);
		}

		/// Check if the instruction is valid as the first instruction in macro fusion.
		bool X86AsmBackend::isFirstMacroFusibleInst(const MCInst &Inst) const {
		// An Intel instruction with RIP relative addressing is not macro fusible.
		if (isRIPRelative(Inst))
		return false;
		X86::FirstMacroFusionInstKind FIK =
		X86::classifyFirstOpcodeInMacroFusion(Inst.getOpcode());
		return FIK != X86::FirstMacroFusionInstKind::Invalid;
		}

		/// Check if the two instructions are macro-fused.
		bool X86AsmBackend::isMacroFused(const MCInst &Cmp, const MCInst &Jcc) const {
		const MCInstrDesc &InstDesc = MCII.get(Jcc.getOpcode());
		if (!InstDesc.isConditionalBranch())
		MaskRayUnsubmitted Not Done Reply Inline Actions Check 64bit, then use exact comparison with either `___tls_get_addr` or `__tls_get_addr` MaskRay: Check 64bit, then use exact comparison with either `___tls_get_addr` or `__tls_get_addr`
		skanAuthorUnsubmitted Done Reply Inline Actions Why? There exists TLS Call in 32bit mode. Does `SymbolName.contains("__tls_get_addr")` possibly include more calls rather TLS Call? skan: Why? There exists TLS Call in 32bit mode. Does `SymbolName.contains("__tls_get_addr")`…
		return false;
		if (!isFirstMacroFusibleInst(Cmp))
		return false;
		const X86::FirstMacroFusionInstKind CmpKind =
		X86::classifyFirstOpcodeInMacroFusion(Cmp.getOpcode());
		const X86::SecondMacroFusionInstKind BranchKind =
		classifySecondInstInMacroFusion(Jcc, MCII);
		jyknightUnsubmitted Done Reply Inline Actions This set of functions down to isIndirectBranch() seems unnecessary. Pushing one line const MCInstrDesc &InstDesc = MCII.get(Inst.getOpcode()); into needAlign(const MCInst &Inst), and then just using InstDesc.isReturn() etc. would be fine. jyknight: This set of functions down to isIndirectBranch() seems unnecessary. Pushing one line const…
		return X86::isMacroFused(CmpKind, BranchKind);
		}

		/// Check if the instruction is RIP relative addressing.
		bool X86AsmBackend::isRIPRelative(const MCInst &MI) const {
		unsigned Opcode = MI.getOpcode();
		const MCInstrDesc &Desc = MCII.get(Opcode);
		uint64_t TSFlags = Desc.TSFlags;
		unsigned CurOp = X86II::getOperandBias(Desc);
		int MemoryOperand = X86II::getMemoryOperandNo(TSFlags);
		if (MemoryOperand >= 0) {
		unsigned BaseRegNum = MemoryOperand + CurOp + X86::AddrBaseReg;
		unsigned BaseReg = MI.getOperand(BaseRegNum).getReg();
		if (BaseReg == X86::RIP)
		return true;
		}
		return false;
		}

		/// Check if the instruction has variant symbol operand.
		bool X86AsmBackend::hasVariantSymbol(const MCInst &MI) const {

		for (auto &Operand : MI) {
		if (Operand.isExpr()) {
		const MCExpr &Expr = *Operand.getExpr();
		if (Expr.getKind() == MCExpr::SymbolRef &&
		cast<MCSymbolRefExpr>(*Operand.getExpr()).getKind() !=
		MCSymbolRefExpr::VK_None)
		return true;
		}
		}
		return false;
		}

		bool X86AsmBackend::needAlign(MCObjectStreamer &OS) const {
		if (AlignBoundary == Align::None() \|\|
		AlignBranchType == X86AlignBranchKind::AlignBranchNone)
		return false;

		MCAssembler &Assembler = OS.getAssembler();
		MCSection *Sec = OS.getCurrentSectionOnly();
		// To be Done: Currently don't deal with Bundle cases.
		if (Assembler.isBundlingEnabled() && Sec->isBundleLocked())
		return false;

		jyknightUnsubmitted Done Reply Inline Actions please run something like "git clang-format HEAD~1" to re-format your patch. jyknight: please run something like "git clang-format HEAD~1" to re-format your patch.
		// Branches only need to be aligned in 32-bit or 64-bit mode.
		if (!(STI.getFeatureBits()[X86::Mode64Bit] \|\|
		STI.getFeatureBits()[X86::Mode32Bit]))
		return false;
		jyknightUnsubmitted Done Reply Inline Actions Comment on why this is doing what it's doing? jyknight: Comment on why this is doing what it's doing?

		return true;
		}

		/// Check if the instruction operand needs to be aligned. Padding is disabled
		/// before intruction which may be rewritten by linker(e.g. TLSCALL).
		bool X86AsmBackend::needAlignInst(const MCInst &Inst) const {
		// Linker may rewrite the instruction with variant symbol operand.
		if (hasVariantSymbol(Inst))
		return false;

		const MCInstrDesc &InstDesc = MCII.get(Inst.getOpcode());
		return (InstDesc.isConditionalBranch() &&
		(AlignBranchType & X86AlignBranchKind::AlignBranchJcc)) \|\|
		(InstDesc.isUnconditionalBranch() &&
		(AlignBranchType & X86AlignBranchKind::AlignBranchJmp)) \|\|
		(InstDesc.isCall() &&
		(AlignBranchType & X86AlignBranchKind::AlignBranchCall)) \|\|
		(InstDesc.isReturn() &&
		(AlignBranchType & X86AlignBranchKind::AlignBranchRet)) \|\|
		(InstDesc.isIndirectBranch() &&
		(AlignBranchType & X86AlignBranchKind::AlignBranchIndirect));
		}

		static bool canReuseBoundaryAlignFragment(const MCBoundaryAlignFragment &F) {
		// If a MCBoundaryAlignFragment has not been used to emit NOP,we can reuse it.
		return !F.canEmitNops();
		}

		MCBoundaryAlignFragment *
		X86AsmBackend::getOrCreateBoundaryAlignFragment(MCObjectStreamer &OS) const {
		auto *F = dyn_cast_or_null<MCBoundaryAlignFragment>(OS.getCurrentFragment());
		if (!F \|\| !canReuseBoundaryAlignFragment(*F)) {
		F = new MCBoundaryAlignFragment(AlignBoundary);
		OS.insert(F);
		}
		return F;
		}

		/// Insert MCBoundaryAlignFragment before instructions to align branches.
		void X86AsmBackend::alignBranchesBegin(MCObjectStreamer &OS,
		const MCInst &Inst) {
		if (!needAlign(OS))
		return;

		MCFragment *CF = OS.getCurrentFragment();
		bool NeedAlignFused = AlignBranchType & X86AlignBranchKind::AlignBranchFused;
		if (NeedAlignFused && isMacroFused(PrevInst, Inst) && CF) {
		// Macro fusion actually happens and there is no other fragment inserted
		// after the previous instruction. NOP can be emitted in PF to align fused
		// jcc.
		if (auto *PF =
		dyn_cast_or_null<MCBoundaryAlignFragment>(CF->getPrevNode())) {
		const_cast<MCBoundaryAlignFragment *>(PF)->setEmitNops(true);
		const_cast<MCBoundaryAlignFragment *>(PF)->setFused(true);
		}
		} else if (needAlignInst(Inst)) {
		// Note: When there is at least one fragment, such as MCAlignFragment,
		// inserted after the previous instruction, e.g.
		//
		// \code
		// cmp %rax %rcx
		// .align 16
		// je .Label0
		// \ endcode
		jyknightUnsubmitted Done Reply Inline Actions Confusing name, and doesn't need to be a separate function than needAlign(const MCAssembler &, MCSection ). Just merge it into that. jyknight:* Confusing name, and doesn't need to be a separate function than needAlign(const MCAssembler &…
		//
		// We will treat the JCC as a unfused branch although it may be fused
		// with the CMP.
		auto *F = getOrCreateBoundaryAlignFragment(OS);
		MaskRayUnsubmitted Done Reply Inline Actions Space after `if` MaskRay: Space after `if`
		F->setEmitNops(true);
		jyknightUnsubmitted Done Reply Inline Actions These functions are only used in one place, and it doesn't make it more readable to split them off. Just merge them into needAlign(const MCInst &Inst) or alignBranchesBegin as appropriate. But also -- AlignBoundarySize shouldn't even need to be checked here, since it's already be checked in needAlign(const MCAssembler &, MCSection ), which is always called first. jyknight:* These functions are only used in one place, and it doesn't make it more readable to split them…
		F->setFused(false);
		} else if (NeedAlignFused && isFirstMacroFusibleInst(Inst)) {
		// We don't know if macro fusion happens until the reaching the next
		// instruction, so a place holder is put here if necessary.
		getOrCreateBoundaryAlignFragment(OS);
		}

		PrevInst = Inst;
		}

		/// Insert a MCBoundaryAlignFragment to mark the end of the branch to be aligned
		/// if necessary.
		void X86AsmBackend::alignBranchesEnd(MCObjectStreamer &OS, const MCInst &Inst) {
		if (!needAlign(OS))
		return;
		// If the branch is emitted into a MCRelaxableFragment, we can determine the
		// size of the branch easily in MCAssembler::relaxBoundaryAlign. When the
		// branch is fused, the fused branch(macro fusion pair) must be emitted into
		// two fragments. Or when the branch is unfused, the branch must be emitted
		// into one fragment. The MCRelaxableFragment naturally marks the end of the
		// fused or unfused branch.
		// Otherwise, we need to insert a MCBoundaryAlignFragment to mark the end of
		// the branch. This MCBoundaryAlignFragment may be reused to emit NOP to align
		// other branch.
		if (needAlignInst(Inst) && !isa<MCRelaxableFragment>(OS.getCurrentFragment()))
		OS.insert(new MCBoundaryAlignFragment(AlignBoundary));

		// Update the maximum alignment on the current section if necessary.
		MCSection *Sec = OS.getCurrentSectionOnly();
		if (AlignBoundary.value() > Sec->getAlignment())
		jyknightUnsubmitted Done Reply Inline Actions Doesn't need to have the same name as the next needAlign, clearer if these two overloads are given different names. jyknight: Doesn't need to have the same name as the next needAlign, clearer if these two overloads are…
		Sec->setAlignment(AlignBoundary);
		}

Optional<MCFixupKind> X86AsmBackend::getFixupKind(StringRef Name) const {		Optional<MCFixupKind> X86AsmBackend::getFixupKind(StringRef Name) const {
if (STI.getTargetTriple().isOSBinFormatELF()) {		if (STI.getTargetTriple().isOSBinFormatELF()) {
if (STI.getTargetTriple().getArch() == Triple::x86_64) {		if (STI.getTargetTriple().getArch() == Triple::x86_64) {
if (Name == "R_X86_64_NONE")		if (Name == "R_X86_64_NONE")
return FK_NONE;		return FK_NONE;
} else {		} else {
if (Name == "R_386_NONE")		if (Name == "R_386_NONE")
return FK_NONE;		return FK_NONE;
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	void X86AsmBackend::relaxInstruction(const MCInst &Inst,
}		}

Res = Inst;		Res = Inst;
Res.setOpcode(RelaxedOp);		Res.setOpcode(RelaxedOp);
}		}

/// Write a sequence of optimal nops to the output, covering \p Count		/// Write a sequence of optimal nops to the output, covering \p Count
/// bytes.		/// bytes.
/// \return - true on success, false on failure		/// \return - true on success, false on failure
bool X86AsmBackend::writeNopData(raw_ostream &OS, uint64_t Count) const {		bool X86AsmBackend::writeNopData(raw_ostream &OS, uint64_t Count) const {
		jyknightUnsubmitted Done Reply Inline Actions Why? jyknight: Why?
static const char Nops[10][11] = {		static const char Nops[10][11] = {
// nop		// nop
"\x90",		"\x90",
// xchg %ax,%ax		// xchg %ax,%ax
"\x66\x90",		"\x66\x90",
// nopl (%[re]ax)		// nopl (%[re]ax)
"\x0f\x1f\x00",		"\x0f\x1f\x00",
// nopl 0(%[re]ax)		// nopl 0(%[re]ax)
"\x0f\x1f\x40\x00",		"\x0f\x1f\x40\x00",
		MaskRayUnsubmitted Not Done Reply Inline Actions `isa<MCMachineDependentFragment>(F)` MaskRay: `isa<MCMachineDependentFragment>(F)`
// nopl 0(%[re]ax,%[re]ax,1)		// nopl 0(%[re]ax,%[re]ax,1)
"\x0f\x1f\x44\x00\x00",		"\x0f\x1f\x44\x00\x00",
// nopw 0(%[re]ax,%[re]ax,1)		// nopw 0(%[re]ax,%[re]ax,1)
"\x66\x0f\x1f\x44\x00\x00",		"\x66\x0f\x1f\x44\x00\x00",
// nopl 0L(%[re]ax)		// nopl 0L(%[re]ax)
"\x0f\x1f\x80\x00\x00\x00\x00",		"\x0f\x1f\x80\x00\x00\x00\x00",
// nopl 0L(%[re]ax,%[re]ax,1)		// nopl 0L(%[re]ax,%[re]ax,1)
"\x0f\x1f\x84\x00\x00\x00\x00\x00",		"\x0f\x1f\x84\x00\x00\x00\x00\x00",
▲ Show 20 Lines • Show All 545 Lines • Show Last 20 Lines

llvm/test/MC/X86/align-branch-32-1a.s

This file was added.

				# Check NOP padding is disabled before instruction that has variant symbol operand.
				# RUN: llvm-mc -filetype=obj -triple i386-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=call %s \| llvm-objdump -d - \| FileCheck %s
				MaskRayUnsubmitted Not Done Reply Inline Actions Did an older version include 32-1b.s or 32-1c.s? Now they are missing. MaskRay: Did an older version include 32-1b.s or 32-1c.s? Now they are missing.

				# CHECK: 00000000 foo:
				# CHECK-COUNT-5: : 64 a3 01 00 00 00 movl %eax, %fs:1
				# CHECK: 1e: e8 fc ff ff ff calll {{.*}}
				# CHECK-COUNT-4: : 64 a3 01 00 00 00 movl %eax, %fs:1
				# CHECK: 3b: 55 pushl %ebp
				# CHECK-NEXT: 3c: ff 91 00 00 00 00 calll *(%ecx)
				# CHECK-COUNT-4: : 64 a3 01 00 00 00 movl %eax, %fs:1
				# CHECK: 5a: c1 e9 02 shrl $2, %ecx
				# CHECK-NEXT: 5d: 55 pushl %ebp
				# CHECK-NEXT: 5e: ff 10 calll *(%eax)
				# CHECK-COUNT-5: : 64 a3 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 7e: ff 20 jmpl *(%eax)
				.text
				.globl foo
				.p2align 4
				foo:
				.rept 5
				movl %eax, %fs:0x1
				.endr
				call ___tls_get_addr@PLT
				.rept 4
				movl %eax, %fs:0x1
				.endr
				pushl %ebp
				call *___tls_get_addr@GOT(%ecx)
				.rept 4
				movl %eax, %fs:0x1
				.endr
				shrl $2, %ecx
				pushl %ebp
				call *foo@tlscall(%eax)
				.rept 5
				reamesUnsubmitted Done Reply Inline Actions For testing alignment functionality, we can probably use a repl int3 pattern here. It would make the tests much more concise, and shouldn't effect the logic being (currently) tested. reames: For testing alignment functionality, we can probably use a repl int3 pattern here. It would…
				movl %eax, %fs:0x1
				.endr
				jmp *foo@tlscall(%eax)

llvm/test/MC/X86/align-branch-64-1a.s

This file was added.

				# Check only fused conditional jumps, conditional jumps and unconditional jumps are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp %s \| llvm-objdump -d - > %t1
				# RUN: FileCheck --input-file=%t1 %s

				# Check no branches is aligned with option --x86-align-branch-boundary=0
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=0 --x86-align-branch=fused+jcc+jmp %s \| llvm-objdump -d - > %t2
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s \| llvm-objdump -d - > %t3
				# RUN: cmp %t2 %t3

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 18: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 1b: 31 c0 xorl %eax, %eax
				# CHECK-COUNT-3: : 90 nop
				# CHECK: 20: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 23: 74 5d je {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 3d: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 3f: 90 nop
				# CHECK-NEXT: 40: 74 40 je {{.*}}
				# CHECK-NEXT: 42: 5d popq %rbp
				# CHECK-NEXT: 43: 74 3d je {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 5d: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 5f: 90 nop
				# CHECK-NEXT: 60: eb 26 jmp {{.*}}
				# CHECK-NEXT: 62: eb 24 jmp {{.*}}
				# CHECK-NEXT: 64: eb 22 jmp {{.*}}
				# CHECK-COUNT-2: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 76: 89 45 fc movl %eax, -4(%rbp)
				# CHECK-NEXT: 79: 5d popq %rbp
				# CHECK-NEXT: 7a: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 7d: 74 03 je {{.*}}
				# CHECK-NEXT: 7f: 90 nop
				# CHECK-NEXT: 80: eb 06 jmp {{.*}}
				# CHECK-NEXT: 82: 8b 45 f4 movl -12(%rbp), %eax
				# CHECK-NEXT: 85: 89 45 fc movl %eax, -4(%rbp)
				# CHECK-COUNT-10: : 89 b5 50 fb ff ff movl %esi, -1200(%rbp)
				# CHECK: c4: eb c2 jmp {{.*}}
				# CHECK-NEXT: c6: c3 retq

				.text
				.globl foo
				.p2align 4
				foo:
				.rept 3
				movl %eax, %fs:0x1
				.endr
				cmp %rax, %rbp
				xorl %eax, %eax
				cmp %rax, %rbp
				je .L_2
				.rept 3
				movl %eax, %fs:0x1
				.endr
				xorl %eax, %eax
				je .L_2
				popq %rbp
				je .L_2
				.rept 3
				movl %eax, %fs:0x1
				.endr
				xorl %eax, %eax
				jmp .L_3
				jmp .L_3
				jmp .L_3
				.rept 2
				movl %eax, %fs:0x1
				.endr
				movl %eax, -4(%rbp)
				popq %rbp
				cmp %rax, %rbp
				je .L_2
				jmp .L_3
				.L_2:
				movl -12(%rbp), %eax
				movl %eax, -4(%rbp)
				.L_3:
				.rept 10
				movl %esi, -1200(%rbp)
				.endr
				jmp .L_3
				retq

llvm/test/MC/X86/align-branch-64-1b.s

This file was added.

				# Check only fused conditional jumps and conditional jumps are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc %S/align-branch-64-1a.s \| llvm-objdump -d - \| FileCheck %s
				MaskRayUnsubmitted Not Done Reply Inline Actions Create test/MC/X86/Inputs/align-branch-64-1.s and reference it from 1[a-d].s via %S/Inputs/align-branch-64-1.s MaskRay: Create test/MC/X86/Inputs/align-branch-64-1.s and reference it from 1[a-d].s via…

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 18: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 1b: 31 c0 xorl %eax, %eax
				# CHECK-COUNT-3: : 90 nop
				# CHECK-NEXT: 20: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 23: 74 5b je {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 3d: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 3f: 90 nop
				# CHECK-NEXT: 40: 74 3e je {{.*}}
				# CHECK-NEXT: 42: 5d popq %rbp
				# CHECK-NEXT: 43: 74 3b je {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 5d: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 5f: eb 25 jmp {{.*}}
				# CHECK-NEXT: 61: eb 23 jmp {{.*}}
				# CHECK-NEXT: 63: eb 21 jmp {{.*}}
				# CHECK-COUNT-2: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 75: 89 45 fc movl %eax, -4(%rbp)
				# CHECK: 78: 5d popq %rbp
				# CHECK-NEXT: 79: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 7c: 74 02 je {{.*}}
				# CHECK-NEXT: 7e: eb 06 jmp {{.*}}
				# CHECK-NEXT: 80: 8b 45 f4 movl -12(%rbp), %eax
				# CHECK-NEXT: 83: 89 45 fc movl %eax, -4(%rbp)
				# CHECK-COUNT-10: : 89 b5 50 fb ff ff movl %esi, -1200(%rbp)
				# CHECK: c2: eb c2 jmp {{.*}}
				# CHECK-NEXT: c4: c3 retq

llvm/test/MC/X86/align-branch-64-1c.s

This file was added.

				# Check only conditional jumps are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=jcc
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=jcc %S/align-branch-64-1a.s \| llvm-objdump -d - \| FileCheck %s

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 18: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 1b: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 1d: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 20: 74 5b je {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 3a: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 3c: 74 3f je {{.*}}
				# CHECK-NEXT: 3e: 5d popq %rbp
				# CHECK-NEXT: 3f: 90 nop
				# CHECK-NEXT: 40: 74 3b je {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 5a: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 5c: eb 25 jmp {{.*}}
				# CHECK-NEXT: 5e: eb 23 jmp {{.*}}
				# CHECK-NEXT: 60: eb 21 jmp {{.*}}
				# CHECK-COUNT-2: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 72: 89 45 fc movl %eax, -4(%rbp)
				# CHECK-NEXT: 75: 5d popq %rbp
				# CHECK-NEXT: 76: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 79: 74 02 je {{.*}}
				# CHECK-NEXT: 7b: eb 06 jmp {{.*}}
				# CHECK-NEXT: 7d: 8b 45 f4 movl -12(%rbp), %eax
				# CHECK-NEXT: 80: 89 45 fc movl %eax, -4(%rbp)
				# CHECK-COUNT-10: : 89 b5 50 fb ff ff movl %esi, -1200(%rbp)
				# CHECK: bf: eb c2 jmp {{.*}}
				# CHECK-NEXT: c1: c3 retq

llvm/test/MC/X86/align-branch-64-1d.s

This file was added.

				# Check only conditional jumps and unconditional jumps are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=jcc+jmp
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=jcc+jmp %S/align-branch-64-1a.s \| llvm-objdump -d - > %t1
				# RUN: FileCheck --input-file=%t1 %s --check-prefixes=CHECK,SHORT-NOP

				# Check long NOP can be emitted to align branch if the target cpu support long nop.
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 -mcpu=x86-64 --x86-align-branch=jcc+jmp %S/align-branch-64-1a.s \| llvm-objdump -d - >%t2
				# RUN: FileCheck --input-file=%t2 %s --check-prefixes=CHECK,LONG-NOP

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 18: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 1b: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 1d: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 20: 74 5d je {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 3a: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 3c: 74 41 je {{.*}}
				# CHECK-NEXT: 3e: 5d popq %rbp
				# CHECK-NEXT: 3f: 90 nop
				# CHECK-NEXT: 40: 74 3d je {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 5a: 31 c0 xorl %eax, %eax
				# CHECK-NEXT: 5c: eb 27 jmp {{.*}}
				# SHORT-NOP-COUNT-2: : 90 nop
				# LONG-NOP: 5e: 66 90 nop
				# CHECK-NEXT: 60: eb 23 jmp {{.*}}
				# CHECK-NEXT: 62: eb 21 jmp {{.*}}
				# CHECK-COUNT-2: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 74: 89 45 fc movl %eax, -4(%rbp)
				# CHECK-NEXT: 77: 5d popq %rbp
				# CHECK-NEXT: 78: 48 39 c5 cmpq %rax, %rbp
				# CHECK-NEXT: 7b: 74 02 je {{.*}}
				# CHECK-NEXT: 7d: eb 06 jmp {{.*}}
				# CHECK-NEXT: 7f: 8b 45 f4 movl -12(%rbp), %eax
				# CHECK-NEXT: 82: 89 45 fc movl %eax, -4(%rbp)
				# CHECK-COUNT-10: : 89 b5 50 fb ff ff movl %esi, -1200(%rbp)
				# CHECK: c1: eb c2 jmp {{.*}}
				# CHECK-NEXT: c3: c3 retq

llvm/test/MC/X86/align-branch-64-2a.s

This file was added.

				# Check only indirect jumps are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=indirect
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=indirect %s \| llvm-objdump -d - \| FileCheck %s

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-COUNT-2: : 89 75 f4 movl %esi, -12(%rbp)
				# CHECK-COUNT-2: : 90 nop
				# CHECK: 20: ff e0 jmpq *%rax
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 3a: 89 75 f4 movl %esi, -12(%rbp)
				# CHECK-NEXT: 3d: 55 pushq %rbp
				# CHECK-NEXT: 3e: ff d0 callq *%rax
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 58: 55 pushq %rbp
				# CHECK-NEXT: 59: e8 a2 ff ff ff callq {{.*}}
				# CHECK-COUNT-4: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 7e: ff 14 25 00 00 00 00 callq *0

				.text
				.globl foo
				.p2align 4
				foo:
				.rept 3
				movl %eax, %fs:0x1
				.endr
				.rept 2
				movl %esi, -12(%rbp)
				.endr
				jmp *%rax
				.rept 3
				movl %eax, %fs:0x1
				.endr
				movl %esi, -12(%rbp)
				pushq %rbp
				call *%rax
				.rept 3
				movl %eax, %fs:0x1
				.endr
				pushq %rbp
				call foo
				.rept 4
				movl %eax, %fs:0x1
				.endr
				call *foo

llvm/test/MC/X86/align-branch-64-2b.s

This file was added.

				# Check only calls are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=call
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=call %S/align-branch-64-2a.s\| llvm-objdump -d - \| FileCheck %s

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-COUNT-2: : 89 75 f4 movl %esi, -12(%rbp)
				# CHECK: 1e: ff e0 jmpq *%rax
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 38: 89 75 f4 movl %esi, -12(%rbp)
				# CHECK-NEXT: 3b: 55 pushq %rbp
				# CHECK-NEXT: 3c: ff d0 callq *%rax
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 56: 55 pushq %rbp
				# CHECK-NEXT: 57: e8 a4 ff ff ff callq {{.*}}
				# CHECK-COUNT-4: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-COUNT-4: : 90 nop
				# CHECK: 80: ff 14 25 00 00 00 00 callq *0

llvm/test/MC/X86/align-branch-64-2c.s

This file was added.

				# Check only indirect jumps and calls are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=indirect+call
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=indirect+call %S/align-branch-64-2a.s \| llvm-objdump -d - \| FileCheck %s

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-COUNT-2: : 89 75 f4 movl %esi, -12(%rbp)
				# CHECK-COUNT-2: : 90 nop
				# CHECK: 20: ff e0 jmpq *%rax
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 3a: 89 75 f4 movl %esi, -12(%rbp)
				# CHECK-NEXT: 3d: 55 pushq %rbp
				# CHECK-COUNT-2: : 90 nop
				# CHECK: 40: ff d0 callq *%rax
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 5a: 55 pushq %rbp
				# CHECK-COUNT-5: : 90 nop
				# CHECK: 60: e8 9b ff ff ff callq {{.*}}
				# CHECK-COUNT-4: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 85: ff 14 25 00 00 00 00 callq *0

llvm/test/MC/X86/align-branch-64-3a.s

This file was added.

				# Check NOP padding is disabled before instruction that has variant symbol operand.
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=jmp+call %s \| llvm-objdump -d - \| FileCheck %s

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-COUNT-2: : 48 89 e5 movq %rsp, %rbp
				# CHECK: 1e: e8 00 00 00 00 callq {{.*}}
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 3b: 55 pushq %rbp
				# CHECK-NEXT: 3c: 89 75 f4 movl %esi, -12(%rbp)
				# CHECK-NEXT: 3f: ff 15 00 00 00 00 callq *(%rip)
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 5d: ff 15 00 00 00 00 callq *(%rip)
				# CHECK-NEXT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 7b: ff 25 00 00 00 00 jmpq *(%rip)

				.text
				.globl foo
				.p2align 4
				foo:
				.rept 3
				movl %eax, %fs:0x1
				.endr
				.rept 2
				movq %rsp, %rbp
				.endr
				call __tls_get_addr@PLT
				.rept 3
				movl %eax, %fs:0x1
				.endr
				pushq %rbp
				movl %esi, -12(%rbp)
				call *__tls_get_addr@GOTPCREL(%rip)
				.rept 3
				movl %eax, %fs:0x1
				.endr
				call *foo@GOTPCREL(%rip)
				.rept 3
				movl %eax, %fs:0x1
				.endr
				jmp *foo@GOTPCREL(%rip)

llvm/test/MC/X86/align-branch-64-4a.s

This file was added.

				# Check only rets are aligned with option --x86-align-branch-boundary=32 --x86-align-branch=ret
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=ret %s \| llvm-objdump -d - \| FileCheck %s

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-COUNT-2: : 48 89 e5 movq %rsp, %rbp
				# CHECK: 1e: 5a popq %rdx
				# CHECK-NEXT: 1f: 90 nop
				# CHECK-NEXT: 20: c3 retq
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 39: 89 75 f4 movl %esi, -12(%rbp)
				# CHECK-NEXT: 3c: 31 c0 xorl %eax, %eax
				# CHECK-COUNT-2: : 90 nop
				# CHECK: 40: c2 1e 00 retq $30

				.text
				.globl foo
				.p2align 4
				foo:
				.rept 3
				movl %eax, %fs:0x1
				.endr
				.rept 2
				movq %rsp, %rbp
				.endr
				popq %rdx
				ret
				.rept 3
				movl %eax, %fs:0x1
				.endr
				movl %esi, -12(%rbp)
				xorl %eax, %eax
				ret $30

llvm/test/MC/X86/align-branch-64-5a.s

This file was added.

				# Check no nop is inserted if no branch cross or is against the boundary
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret %s \| llvm-objdump -d - > %t1
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s \| llvm-objdump -d - > %t2
				# RUN: cmp %t1 %t2
				# RUN: FileCheck --input-file=%t1 %s

				# CHECK: 0000000000000000 foo:
				# CHECK-COUNT-3: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 18: c1 e9 02 shrl $2, %ecx
				# CHECK-NEXT: 1b: 89 d1 movl %edx, %ecx
				# CHECK-NEXT: 1d: 75 fc jne {{.*}}
				# CHECK-NEXT: 1f: 55 pushq %rbp
				# CHECK-NEXT: 20: f6 c2 02 testb $2, %dl
				# CHECK-NEXT: 23: 75 fa jne {{.*}}
				# CHECK-COUNT-2: : 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK: 35: c1 e9 02 shrl $2, %ecx
				# CHECK-NEXT: 38: e8 c3 ff ff ff callq {{.*}}
				# CHECK-NEXT: 3d: ff e0 jmpq *%rax
				# CHECK-NEXT: 3f: 55 pushq %rbp
				# CHECK-NEXT: 40: c2 63 00 retq $99

				.text
				.p2align 4
				foo:
				.rept 3
				movl %eax, %fs:0x1
				.endr
				shrl $2, %ecx
				.L1:
				movl %edx, %ecx
				jne .L1
				.L2:
				push %rbp
				testb $2, %dl
				jne .L2
				.rept 2
				movl %eax, %fs:0x1
				.endr
				shrl $2, %ecx
				call foo
				jmp *%rax
				push %rbp
				ret $99

llvm/test/MC/X86/align-branch-64-5b.s

This file was added.

				# Check option --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret can cowork with option --mc-relax-all
				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown --x86-align-branch-boundary=32 --x86-align-branch=fused+jcc+jmp+indirect+call+ret --mc-relax-all %s \| llvm-objdump -d - > %t1
				# RUN: FileCheck --input-file=%t1 %s

				# CHECK: 0000000000000000 foo:
				# CHECK-NEXT: 0: 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 8: 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 10: 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 18: c1 e9 02 shrl $2, %ecx
				# CHECK-NEXT: 1b: 89 d1 movl %edx, %ecx
				# CHECK-NEXT: 1d: 90 nop
				# CHECK-NEXT: 1e: 90 nop
				# CHECK-NEXT: 1f: 90 nop
				# CHECK-NEXT: 20: 0f 85 f5 ff ff ff jne {{.*}}
				# CHECK-NEXT: 26: 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 2e: 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 36: f6 c2 02 testb $2, %dl
				# CHECK-NEXT: 39: 0f 85 e7 ff ff ff jne {{.*}}
				# CHECK-NEXT: 3f: 90 nop
				# CHECK-NEXT: 40: e9 d6 ff ff ff jmp {{.*}}
				# CHECK-NEXT: 45: 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 4d: 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 55: 64 89 04 25 01 00 00 00 movl %eax, %fs:1
				# CHECK-NEXT: 5d: 90 nop
				# CHECK-NEXT: 5e: 90 nop
				# CHECK-NEXT: 5f: 90 nop
				# CHECK-NEXT: 60: e8 9b ff ff ff callq {{.*}}
				# CHECK-NEXT: 65: e9 bc ff ff ff jmp {{.*}}
				.text
				.p2align 4
				foo:
				.rept 3
				movl %eax, %fs:0x1
				.endr
				shrl $2, %ecx
				.L1:
				movl %edx, %ecx
				jne .L1
				.L2:
				.rept 2
				movl %eax, %fs:0x1
				.endr
				testb $2, %dl
				jne .L2
				jmp .L1
				.rept 3
				movl %eax, %fs:0x1
				.endr
				call foo
				jmp .L2

This is an archive of the discontinued LLVM Phabricator instance.

Align branches within 32-Byte boundary(NOP padding)ClosedPublic

Details

Diff Detail

Event Timeline

Segment-prefix padding

Branch alignment

Enabling by default

Branch alignment

Status Summary

Discussion on Approach

Other Topics

Next Steps

Revision Contents

Diff 234939

llvm/include/llvm/MC/MCAsmBackend.h

llvm/include/llvm/MC/MCAssembler.h

llvm/include/llvm/MC/MCFragment.h

llvm/include/llvm/MC/MCObjectStreamer.h

llvm/lib/MC/MCAssembler.cpp

llvm/lib/MC/MCFragment.cpp

llvm/lib/MC/MCObjectStreamer.cpp

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp

llvm/test/MC/X86/align-branch-32-1a.s

llvm/test/MC/X86/align-branch-64-1a.s

llvm/test/MC/X86/align-branch-64-1b.s

llvm/test/MC/X86/align-branch-64-1c.s

llvm/test/MC/X86/align-branch-64-1d.s

llvm/test/MC/X86/align-branch-64-2a.s

llvm/test/MC/X86/align-branch-64-2b.s

llvm/test/MC/X86/align-branch-64-2c.s

llvm/test/MC/X86/align-branch-64-3a.s

llvm/test/MC/X86/align-branch-64-4a.s

llvm/test/MC/X86/align-branch-64-5a.s

llvm/test/MC/X86/align-branch-64-5b.s

Align branches within 32-Byte boundary(NOP padding)
ClosedPublic